R curl::has_internet() FALSE even though there are internet connection - r

My problem arose when downloading data from EuroSTAT using the R package eurostat:
# Population data by NUTS3
pop_data <- subset(eurostat::get_eurostat("demo_r_pjangrp3", time_format = "num"),
(age == "TOTAL") & (sex == "T") &
(nchar(trimws(geo)) == 5))[, c("time","geo","values")]
#Fejl i eurostat::get_eurostat("demo_r_pjangrp3", time_format = "num") :
# You have no internet connection, please reconnect!
Seaching, I have found out that it is the statement (in the eurostat-package code):
if (curl::has_internet() {stop("You have no inernet connection, please connnect") that cause the problem.
However, I have interconnection and can e.g. ping www.eurostat.eu
I have tried curl::has_internet() on different computers, all with internet connection. On some it work (respond TRUE) on others it don't.
I have talked with our IT department, and we tried if it could be a firewall problem. Removing the firewall, did not solve the problem.
Unfortunately, I am ignorant on network-settings. Hence, when trying to read the documentation for the curl-package I am lost.
Downloading data from EuroSTAT using the command above have worked for the last at least 2 years, and for me the problem arose at the start of 2020 (January 7).
Hope someone can help with this, as downloading population data from EuroSTAT is a mandatory part in more of my/our regular work.

In the special case of curl::has_internet, you don't need to modify the function to return a specific value. It has its own enclosing environment, from which it reads a state variable indicating whether a proxy connection exists. You can modify that state variable instead.
assign("has_internet_via_proxy", TRUE, environment(curl::has_internet))
curl::has_internet() # will always be TRUE
# [1] TRUE

It's difficult to tell without knowing your settings but there are a couple of things to try. This issue has been noted and possibly addressed in a development version which you can install with
install.packages("https://github.com/jeroen/curl/archive/master.tar.gz", repos = NULL)
You could also try updating libcurl, which is the C library for which the R package acts as an R interface. The problem you describe seems to be more common with older versions of libcurl.
If all else fails, you could overwrite the curl::has_internet function like this:
remove_has_internet <- function()
{
unlockBinding(sym = "has_internet", asNamespace("curl"))
assign("has_internet", function() return(TRUE), envir = asNamespace("curl"))
lockBinding(sym = "has_internet", asNamespace("curl"))
}
Now if you run remove_has_internet(), any call to curl::has_internet() will return TRUE for the remainder of your R session. However, this will only work if other curl functionality is working properly with your network settings. If it isn't then you will get other strange errors and should abandon this approach.
If, for any reason, you want to restore the functionality of the original curl::has_internet without restarting an R session, you can do this:
restore_has_internet <- function()
{
unlockBinding(sym = "has_internet", asNamespace("curl"))
assign("has_internet",
function() {!is.null(nslookup("r-project.org", error = FALSE))},
envir = asNamespace("curl"))
lockBinding(sym = "has_internet", asNamespace("curl"))
}

I just got into this problem, so here's an additional solution, blending both previous answers. It's reversible and checks if we actually have internet to avoid bigger problems later.
# old value
op = get("has_internet_via_proxy", environment(curl::has_internet))
# check for internet
np = !is.null(curl::nslookup("r-project.org", error = FALSE))
assign("has_internet_via_proxy", np, environment(curl::has_internet))
Within a function, this line can be added to automatically revert the process:
on.exit(assign("has_internet_via_proxy", op, environment(curl::has_internet)))

Related

STRINGdb r environment; error in plot_network

I'm trying to use stringdb in R and i'm getting the following error when i try to plot the network:
Error in if (grepl("The document has moved", res)) { : argument is
of length zero
code:
library(STRINGdb)
#(specify organism)
string_db <- STRINGdb$new( version="10", species=9606, score_threshold=0)
filt_mapped = string_db$map(filt, "GeneID", removeUnmappedRows = TRUE)
head(filt_mapped)
(i have columns titled: GeneID, logFC, FDR, STRING_id with 156 rows)
filt_mapped_hits = filt_mapped$STRING_id
head(filt_mapped_hits)
(156 observations)
string_db$plot_network(filt_mapped_hits, add_link = FALSE)
Error in if (grepl("The document has moved", res)) { : argument is
of length zero
You are using quite few years old version of Bioconductor and by extension the STRING package.
If you update to the newest one, it will work. However the updated package only supports only the latest version STRING (currently version 11), so the underlying network may change a bit.
More detailed reason is this:
The STRING's hardware infrastructure underwent recently major changes which forced a different server setup.
Now all the old calls are forwarded to a different URL, however the cURL call, how it was implemented, does not follow our redirects which breaks the STRINGdb package functionality.
We cannot update the old bioconductor package and our server setup can’t be really changed.
That said, the fix for an old version is relatively simple.
In STRINGdb library there is script with all the methods "rstring.r".
In there you’ll find “get_png” method. In it replace this line:
urlStr = paste("http://string-db.org/version_", version, "/api/image/network", sep="" )
With this line:
urlStr = paste("http://version", version, ".string-db.org/api/image/network", sep="" )
Load the library again and it should create the PNG, as before.

OS-independent way to select directory interactively in R

I would like users to be able to select a directory interactively in R. The solution needs to work on different platforms (at least on Linux, Windows and Mac machines that have a graphical desktop environment). And it needs to be robust enough to work on a variety of computers. I've run into problems with the variants I know of:
file.choose() unfortunately only works for files - It won't allow to select a directory. Other than this limitation, file.choose is a good example of the type of solution I'm looking for - it works across platforms and does not have external dependencies that may not be available on a particular computer.
choose.dir() Only works on Windows.
tk_choose.dir() from library(tcltk) was my preferred solution until recently. But I've had users report that this throws an error
log4cplus:ERROR No appenders could be found for logger (AdSyncNamespace).
log4cplus:ERROR Please initialize the log4cplus system properly.
which we tracked back to Autodesk360 software being installed, which for some reason interferes with tcltk. So this is not a suitable solution unless there is a fix for this. (the only solution I've found by googling is to uninstall Autodesk360, which won't be a solution for users who installed it because they actually use it).
This answer suggests the following as a possible alternative:
library(rJava)
library(rChoiceDialogs)
jchoose.dir()
But, as an example of the sort of thing that can go wrong with this, when I tried to install.packages("rJava") I got:
checking whether JNI programs can be compiled... configure: error:
Cannot compile a simple JNI program. See config.log for details.
Make sure you have Java Development Kit installed and correctly
registered in R. If in doubt, re-run "R CMD javareconf" as root.
ERROR: configuration failed for package ‘rJava’
* removing ‘/home/dominic/R/x86_64-pc-linux-gnu-library/3.3/rJava’ Warning in install.packages : installation of package ‘rJava’ had
non-zero exit status
I managed to fix this on my own machine (linux running openJDK) by installing the openjdk compiler using the linux package manager then running sudo R CMD javareconf. But I can't expect random users with varying levels of computer expertise to have to jump through hoops just so that they can select a directory. Even if they do manage to fix it, it will look bad when every other piece of software they use manages to open a directory-selection dialogue without any problems.
So my question: Is there a robust method that can reliably be expected to "just work" (like file.choose does for files), on a variety of platforms and makes no expectation of the end user being computer literate enough to solve these kinds of issues (such as incompatabilities with Autodesk360 or unresolved Java dependencies)?
In the time since posting this question and an earlier version of this answer, I've managed to test the various options that have been suggested on a range of computers. This process has converged on a fairly simple solution. The only cases I have found where tcltk::tk_choose.dir() fails due to conflicts are on Windows computers running Autodesk software. But on Windows, we have utils::choose.dir available instead. So the answer I am currently running with is:
choose_directory = function(caption = 'Select data directory') {
if (exists('utils::choose.dir')) {
choose.dir(caption = caption)
} else {
tk_choose.dir(caption = caption)
}
}
For completeness, I think it is useful to summarise some of the issues with other approaches and why they do not meet the criteria of being generally robust on a variety of platforms (including robustness against potentially unresolved external dependencies that can't be fixed from within R and that that may require administrator privileges and/or expertise to fix):
easycsv::choose_dir in Linux depends on zenity, which may not be available.
rstudioapi::selectDirectory requires that we are in RStudio Version greater than 1.1.287.
rChoiceDialogs::rchoose.dir requires not only that java runtime environment is installed, but also java compiler must be installed and configured correctly to work with rJava.
utils::menu does not work if the R function is run from the command line, rather than in an interactive session. Also on Linux X11 it frequently leaves an orphan window open after execution, which can't be readily closed.
gWidgets2::gfile has external dependency on either gtk2 or tcltk or Qt. Resolving these dependencies was found to be non-trivial in some cases.
Archived earlier version of this answer
Finally, an earlier version of this answer contained some longer code that tries out several possible solutions to find one that works. Although I have settled on the simple version above, I leave this version archived here in case it proves useful to someone else.
What it tries:
Check whether the function utils::choose.dir exists (will only be available on Windows). If so, use that
Check whether the user is working from within RStudio version 1.1.287 or greater. If so use the RStudio API.
Check if we can load the tcltk package and then open and close a tcltk window without throwing an error. If so, use tcltk.
Check whether we can load gWidgets2 and the RGtk2 widgets. If so, use gWidgets2. I don't try to load the tcltk widgets here, because if they worked, presumably we would already be using the tcltk package. I also do not try to load the Qt widgets, as they seem somewhat unmaintained and are not currently available on CRAN.
Check if we can load rJava and rChoiceDialogs. If so, use rChoiceDialogs.
If none of the above are successful, use a fallback position of requesting the directory name at the console.
Here's the longer version of the code:
# First a helper function to load packages, installing them first if necessary
# Returns logical value for whether successful
ensure_library = function (lib.name){
x = require(lib.name, quietly = TRUE, character.only = TRUE)
if (!x) {
install.packages(lib.name, dependencies = TRUE, quiet = TRUE)
x = require(lib.name, quietly = TRUE, character.only = TRUE)
}
x
}
select_directory_method = function() {
# Tries out a sequence of potential methods for selecting a directory to find one that works
# The fallback default method if nothing else works is to get user input from the console
if (!exists('.dir.method')){ # if we already established the best method, just use that
# otherwise lets try out some options to find the best one that works here
if (exists('utils::choose.dir')) {
.dir.method = 'choose.dir'
} else if (rstudioapi::isAvailable() & rstudioapi::getVersion() > '1.1.287') {
.dir.method = 'RStudioAPI'
ensure_library('rstudioapi')
} else if(ensure_library('tcltk') &
class(try({tt <- tktoplevel(); tkdestroy(tt)}, silent = TRUE)) != "try-error") {
.dir.method = 'tcltk'
} else if (ensure_library('gWidgets2') & ensure_library('RGtk2')) {
.dir.method = 'gWidgets2RGtk2'
} else if (ensure_library('rJava') & ensure_library('rChoiceDialogs')) {
.dir.method = 'rChoiceDialogs'
} else {
.dir.method = 'console'
}
assign('.dir.method', .dir.method, envir = .GlobalEnv) # remember the chosen method for later
}
return(.dir.method)
}
choose_directory = function(method = select_directory_method(), title = 'Select data directory') {
switch (method,
'choose.dir' = choose.dir(caption = title),
'RStudioAPI' = selectDirectory(caption = title),
'tcltk' = tk_choose.dir(caption = title),
'rChoiceDialogs' = rchoose.dir(caption = title),
'gWidgets2RGtk2' = gfile(type = 'selectdir', text = title),
readline('Please enter directory path: ')
)
}
Here is a simple directory navigation menu (using menu{utils}):
d=1
while(d != 0) {
a = getwd()
a = strsplit(a, "/")
a = unlist(a)
b = list.dirs(recursive = F, full.names = F)
c = paste("..", a[length(a) - 1], a[length(a)], sep = "/")
d = menu(c("..", b), title = c, graphics = T)
if(d==1){
e=paste(paste(a[1:(length(a)-1)],collapse = '/',sep = ''),'/',sep = '')
#print(e)
setwd(e)
}else{
e=paste(paste(a,collapse = '/',sep = ''),'/',b[d-1],sep='')
#print(e)
setwd(e)
}
}
Note: I did not (yet) test it under different systems. Here is what the documentation says:
If graphics = TRUE and a windowing system is available (Windows, macOS or X11 via Tcl/Tk) a listbox widget is used, otherwise a text menu. It is an error to use menu in a non-interactive session.
One limitation: The title = can only be a single line.
you can use the choose_dir function from easycsv.
it works on Windows, Linux and OSX
easycsv::choose_dir() # can be run without parameters to prompt a folder selection window
for some use cases a little trick might be to use dirname() around file.choose()
dir <- dirname(file.choose())
this will return the directory. It does however require at least one file to be present in the directory
Suggestion for adaption of choose_directory() as mentioned in my comment (06.09.2018 RFelber):
choose_directory <- function(ini_dir = getwd(),
method = select_directory_method(),
title = 'Select data directory') {
switch(method,
'choose.dir' = choose.dir(default = ini_dir, caption = title),
'RStudioAPI' = selectDirectory(path = ini_dir, caption = title),
'tcltk' = tk_choose.dir(default = ini_dir, caption = title),
'rChoiceDialogs' = rchoose.dir(default = ini_dir, caption = title),
'gWidgets2RGtk2' = gfile(type = 'selectdir', text = title, initial.dir = ini_dir),
readline('Please enter directory path: ')
)
}

testthat fails within devtools::check but works in devtools::test

Is there any way to reproduce the environment which is used by devtools::check?
I have the problem that my tests work with devtools::test() but fail within devtools::check(). My problem is now, how to find the problem. The report of check just prints the last few lines of the error log and I can't find the complete report for the testing.
checking tests ... ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
...
I know that check uses a different environment compared to test but I don't know how I should debug these problems since they are not reproducible at all. Specially these test where running a few month ago, so not sure where to look for the problem.
EDIT
actually I tried to locate my problem and I found a solution. But to post my solution to it, I have to add more details.
So my test always failed since I was testing a markdown script if it is running without errors and afterwards I was checking if some of the environmental variables are set correctly. These where results which I calculate with the script as well as standard settings which I set. So I wanted to get a warning if I forgot to change some of my settings after developing...
Anyway, since it is a markdown script, I had to extract the code and I was using comments from this post knitr: run all chunks in an Rmarkdown document using knitr::purl to get the code and sys.source to execute it.
runAllChunks <- function(rmd, envir=globalenv()){
# as found here https://stackoverflow.com/questions/24753969
tempR <- tempfile(tmpdir = '.', fileext = ".R")
on.exit(unlink(tempR))
knitr::purl(rmd, output=tempR, quiet=TRUE)
sys.source(tempR, envir=envir)
}
For some reason, this produces an error since maybe a few weeks (not sure which new packages I installed lately...). But since there is a new comment, that I can just use knitr::knit which also executes the code, this worked as expected and now my test no longer complains.
So in the end, I don't know where the problem exactly was, but this is now working.
I recently had a similar issue with my tests breaking (succeeding with devtools::test() but failing with devtools::check()). I don't know if this solution necessarily fixes the problem above, but it should help to track down similar problems.
In my case, the problem ultimately came down to using a function that needed a package listed in Suggests rather than in Imports/Depends. In particular, my function called httr::content(), which broke when I tried to pass it the as = "parsed" argument. It turns out that as = "parsed" uses a suggested package, readr to read a csv, and I needed to add it to my dependencies for devtools::check() to work.
This is a known issue with testthat. The workaround is to add the following as the 1st line in tests/testthat.R:
Sys.setenv(R_TESTS="")
In case it helps someone else, this is what worked for me
Re-install all relevant packages. E.g. install.packages("testthat", "dplyr", "lubridate", "stringr") (I included all packages my package uses)
Close RStudio and reopen
Then all tests passed
I spent much too long looking into this error, so hoping that I can help someone out in the future. I would like to add to this that I was getting this error while using ggplot2::autoplot() in my function and it required that I added #import ggfortify to the Roxygen skeleton part of my function.
I ran into the same issue with my tests failing under devtools::check() while not failing under testthat::test()
And none of the above applied to my problem, so i decided to post my issue plus solution here as well. But first some NOTEs from my experience:
devtools::check() does - so it seems - deeper error checking then your own written tests.
Now to my code-setup. I had a function that was build to retrieve values from two different files. Those files contained named profiles with a set of values per profile. But the profiles were named differently, depending on the files:
Example files:
Content of file_one:
[default]
value_A = "foo"
value_B = "bar"
value_C = "baz"
[peter]
value_A = "oof"
value_B = "rab"
value_C = "zab"
content of file_two:
[default]
value_X = "fuzzly"
value_Z = "puzzly"
[profile peter]
value_X = "fuzzly"
value_Z = "puzzly"
As you can see, does the naming in file two follow another naming convention, when it comes to the named profiles. The profiles are written in "[]" and the default-profile is always '[default]' in both files. But as soon as it comes to named profiles, its just '[name]' in one file and then '[profile name]' in the other one.
Now i've build the function like that (simplyfied):
get_value <- function(file_content, what, profile) {
file_content <- readr::read_lines(file)
all_profiles_at <- grep("\\[.*\\]", file_content)
profile_regex <- paste0("\\[",if(file_content == "file_two" && profile != "default") "profile ",profile,"\\]")
profile_at <- grep(profile_regex, file_content)
profile_ends_at <- if(profile_at == max(all_profiles_at)) length(file_content) else all_profiles_at[grep(paste0("^",profile_at,"$"), all_profiles_at) + 1] -1
profile_content <- file_content[profile_at:profile_ends_at]
whole_what <- stringr::str_replace_all(profile_content[grep(paste0("^",what,".*"), profile_content)], " ", "")
return(stringr::str_sub(whole_what, stringr::str_length(paste0(what,"=."))))
}
With this code my tests ran smoothly and even check() found no issues.
While the whole code evolved i figured, that i should read the files content beforehand and give only the alread read_in content to the function to avoid duplication in my code. So i changed the function like so:
get_value <- function(file, what, profile) {
is_file_two <- is_file_two(file_content)
all_profiles_at <- grep("\\[.*\\]", file_content)
profile_regex <- paste0("\\[",if(file_content == "file_two" && profile != "default") "profile ",profile,"\\]")
profile_at <- grep(profile_regex, file_content)
profile_ends_at <- if(profile_at == max(all_profiles_at)) length(file_content) else all_profiles_at[grep(paste0("^",profile_at,"$"), all_profiles_at) + 1] -1
profile_content <- file_content[profile_at:profile_ends_at]
whole_what <- stringr::str_replace_all(profile_content[grep(paste0("^",what,".*"), profile_content)], " ", "")
return(stringr::str_sub(whole_what, stringr::str_length(paste0(what,"=."))))
}
As you might notice i only changed the first line of the funciton body and left the if-condition unchanged - my mistake!
But my tests didn't throw an error, as the if-condition still worked. Even though the 'file_content == "file_two"' part now generated a logical vector and if() ... else ... normally throws a warning, when the logical has length > 1. The special construct with the && doesn't throw such an error as it returns a length(1) logical:
# with warning
if(c(FALSE, FALSE, FALSE)) "Done!" else "Not done!"
# no warning:
if(c(FALSE, FALSE, FALSE) && TRUE) "Done!" else "Not done!"
Thats why my tests with testthat::test() sill worked.
But devtools::check() saw this flaw in my code and the tests failed!
And that part of the FAILURE_REPORT showed me my errors:
[...]
where 41: test_check("my_package_name")
--- value of length: 18 type: logical ---
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE
--- function from context ---
[...]
Conclusion:
testthat::test() is great! Is checks whether or not your code still runs. But devtools::check() goes far deeper - and when your tests pass with testthat::test() but fail with devtools::check() then you've probaly got some deeper bugs and flaws in your code you MUST attend to!
So as I shortly mentioned above, I changed some of my code to no longer user knitr::purl but using knitr::knit and this solved my problem.
expect_error(f <- runAllChunks('010_main_lfq_analysis.Rmd'), NA)
expect_error(f <- knitr::knit('010_main_lfq_analysis.Rmd', output='jnk.R', quiet=TRUE, envir=globalenv()), NA)
This could also happen in the following scenario: You have a library already loaded in R and you are referring to the function in that library without namespace binding. For example, suppose you use the nnzero() function from the Matrix in a test file and happen to also have had the Matrix package already loaded with library(Matrix). Then devtools::test() will pass but devtools::check() fails. Using Matrix::nnzero() should fix the problem.

R: How make dump.frames() include all variables for later post-mortem debugging with debugger()

I have the following code which provokes an error and writes a dump of all frames using dump.frames() as proposed e. g. by Hadley Wickham:
a <- -1
b <- "Hello world!"
bad.function <- function(value)
{
log(value) # the log function may cause an error or warning depending on the value
}
tryCatch( {
a.local.value <- 42
bad.function(a)
bad.function(b)
},
error = function(e)
{
dump.frames(to.file = TRUE)
})
When I restart the R session and load the dump to debug the problem via
load(file = "last.dump.rda")
debugger(last.dump)
I cannot find my variables (a, b, a.local.value) nor my function "bad.function" anywhere in the frames.
This makes the dump nearly worthless to me.
What do I have to do to see all my variables and functions for a decent post-mortem analysis?
The output of debugger is:
> load(file = "last.dump.rda")
> debugger(last.dump)
Message: non-numeric argument to mathematical functionAvailable environments had calls:
1: tryCatch({
a.local.value <- 42
bad.function(a)
bad.function(b)
2: tryCatchList(expr, classes, parentenv, handlers)
3: tryCatchOne(expr, names, parentenv, handlers[[1]])
4: value[[3]](cond)
Enter an environment number, or 0 to exit
Selection:
PS: I am using R3.3.2 with RStudio for debugging.
Update Nov. 20, 2016: Note that it is not an R bug (see answer of Martin Maechler). I did not change my answer for reproducibility. The described work around still applies.
Summary
I think dump.frames(to.file = TRUE) is currently an anti pattern (or probably a bug) in R if you want to debug errors of batch jobs in a new R session.
You should better replace it with
dump.frames()
save.image(file = "last.dump.rda")
or
options(error = quote({dump.frames(); save.image(file = "last.dump.rda")}))
instead of
options(error = dump.frames)
because the global environment (.GlobalEnv = the user workspace you normally create your objects) is included then in the dump while it is missing when you save the dump directly via dump.frames(to.file = TRUE).
Impact analysis
Without the .GlobalEnv you loose important top level objects (and their current values ;-) to understand the behaviour of your code that led to an error!
Especially in case of errors in "non-interactive" R batch jobs you are lost without .GlobalEnv since you can debug only in a newly started (empty) interactive workspace where you then can only access the objects in the call stack frames.
Using the code snippet above you can examine the object values that led to the error in a new R workspace as usual via:
load(file = "last.dump.rda")
debugger(last.dump)
Background
The implementation of dump.frames creates a variable last.dump in the workspace and fills it with the environments of the call stack (sys.frames(). Each environment contains the "local variables" of the called function). Then it saves this variable into a file using save().
The frame stack (call stack) grows with each call of a function, see ?sys.frames:
.GlobalEnv is given number 0 in the list of frames. Each subsequent
function evaluation increases the frame stack by 1 and the [...] environment for evaluation of that function are returned by [...] sys.frame with the appropriate index.
Observe that the .GlobalEnv has the index number 0.
If I now start debugging the dump produced by the code in the question and select the frame 1 (not 0!) I can see a variable parentenv which points (references) the .GlobalEnv:
Browse[1]> environmentName(parentenv)
[1] "R_GlobalEnv"
Hence I believe that sys.frames does not contain the .GlobalEnv and therefore dump.frames(to.file = TRUE) neither since it only stores the sys.frames without all other objects of the .GlobalEnv.
Maybe I am wrong, but this looks like an unwanted effect or even a bug.
Discussions welcome!
References
https://cran.r-project.org/doc/manuals/R-exts.pdf
Excerpt from section 4.2 Debugging R code (page 96):
Because last.dump can be looked at later or even in another R session,
post-mortem debug- ging is possible even for batch usage of R. We do
need to arrange for the dump to be saved: this can be done either
using the command-line flag
--save to save the workspace at the end of the run, or via a setting such as
options(error = quote({dump.frames(to.file=TRUE); q()}))
Note that it is often more productive to work with the R Core team rather than just telling that R has a bug. It clearly has no bug, here, as it behaves exactly as documented.
Also there is no problem if you work interactively, as you have full access to your workspace (which may be LARGE) there, so the problem applies only to batch jobs (as you've mentioned).
What we rather have here is a missing feature and feature requests (and bug reports!) should happen on the R bug site (aka _'R bugzilla'), https://bugs.r-project.org/ ... typically however after having read the corresponding page on the R website: https://www.r-project.org/bugs.html.
Note that R bugzilla is searchable, and in the present case, you'd pretty quickly find that Andreas Kersting made a nice proposal (namely as a wish, rather than claiming a bug),
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17116
and consequently I had added the missing feature to R, on Aug.16, already.
Yes, of course, the development version of R, aka R-devel.
See also today's thread on the R-devel mailing list,
https://stat.ethz.ch/pipermail/r-devel/2016-November/073378.html

Package that downloads data from the internet during installation

Is anyone aware of a package that downloads a dataset from the internet during the installation process and then prepares and saves it so that it is available when loading the package using library(packageName)? Are there any drawbacks in this approach (besides the obvious one that package installation will fail if the data source is unavailable or the data format has changed)?
EDIT: Some background. The data is three tab-separated files in a ZIP archive, owned by federal statistics and generally freely accessible. I have R code which downloads, extracts and prepares the data, in the end three data frames are created which could be saved in .RData format.
I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it.
I did this mockup before, while you were posting your edit. I presume it would work, but not tested. I've commented it so you can see what you would need to change. The idea here is to check to see if an expected object is available in the current working environment. If it is not, check to see that the file that the data can be found in is in the current working directory. If that is not found, prompt the user to download the file, then proceed from there.
myFunction <- function(this, that, dataset) {
# We're giving the user a chance to specify the dataset.
# Maybe they have already downloaded it and saved it.
if (is.null(dataset)) {
# Check to see if the object is already in the workspace.
# If it is not, check to see whether the .RData file that
# contains the object is in the current working directory.
if (!exists("OBJECTNAME", where = 1)) {
if (isTRUE(list.files(
pattern = "^DATAFILE.RData$") == "DATAFILE.RData")) {
load("DATAFILE.RData")
# If neither of those are successful, prompt the user
# to download the dataset.
} else {
ans = readline(
"DATAFILE.RData dataset not found in working directory.
OBJECTNAME object not found in workspace. \n
Download and load the dataset now? (y/n) ")
if (ans != "y")
return(invisible())
# I usually use RCurl in case the URL is https
require(RCurl)
baseURL = c("http://some/base/url/")
# Here, we actually download the data
temp = getBinaryURL(paste0(baseURL, "DATAFILE.RData"))
# Here we load the data
load(rawConnection(temp), envir=.GlobalEnv)
message("OBJECTNAME data downloaded from \n",
paste0(baseURL, "DATAFILE.RData \n"),
"and added to your workspace\n\n")
rm(temp, baseURL)
}
}
dataset <- OBJECTNAME
}
TEMP <- dataset
## Other fun stuff with TEMP, this, and that.
}
Two packages, hosted at Github
Here's another approach, building on the comments between #juba and I. The basic concept is to have, as you describe, one package for the codes and one for the data. This function would be part of the package that contains your code. It will:
Check to see if the data package is installed
Check to see if the version of the data package you have installed matches the version at Github, which we are going to assume is the most up to date version.
When it fails any of the checks, it asks the user if they want to update their installation of the package. In this case, for demonstration, I've linked to one of my packages in progress at Github. This should give you an idea of what you need to substitute to get it to work with your own package once you've hosted it there.
CheckVersionFirst <- function() {
# Check to see if installed
if (!"StataDCTutils" %in% installed.packages()[, 1]) {
Checks <- "Failed"
} else {
# Compare version numbers
require(RCurl)
temp <- getURL("https://raw.github.com/mrdwab/StataDCTutils/master/DESCRIPTION")
CurrentVersion <- gsub("^\\s|\\s$", "",
gsub(".*Version:(.*)\\nDate.*", "\\1", temp))
if (packageVersion("StataDCTutils") == CurrentVersion) {
Checks <- "Passed"
}
if (packageVersion("StataDCTutils") < CurrentVersion) {
Checks <- "Failed"
}
}
switch(
Checks,
Passed = { message("Everything looks OK! Proceeding!") },
Failed = {
ans = readline(
"'StataDCTutils is either outdated or not installed. Update now? (y/n) ")
if (ans != "y")
return(invisible())
require(devtools)
install_github("StataDCTutils", "mrdwab")
})
# Some cool things you want to do after you are sure the data is there
}
Try it out with CheckVersionFirst().
Note: This would succeed only if you religiously remember to update your version number in your description file every time you push a new version of the data to Github!
So, to clarify/recap/expand, the basic idea would be to:
Periodically push the updated version of your data package to Github, being sure to change the version number of the data package in its DESCRIPTION file when you do so.
Integrate this CheckVersionFirst() function as an .onLoad event in your code package. (Obviously modify the function to match your account and package name).
Change the commented line that reads # Some cool things you want to do after you are sure the data is there to reflect the cool things you actually want to do, which would probably start with library(YOURDATAPACKAGE) to load the data....
This method may not be efficient, but a good workaround. If you are making a package that needs regularly updated data, first make a package which has that data. It does not need any functions, but I like the concept of a setter (which you might not need in this case) & getter.
Then when you make your package, have the 'data'-package as a dependency. This way, whenever someone installs your package, he/she will always have the latest data.
On your part, you'll just have to swap out the data in your 'data' package, and upload it to the repo you want.
If you don't know how to build a package, check ?packages.skeleton and R CMD CHECK, R CMD BUILD

Resources