R package with code that only runs once (per installation) - r

I'd like to create an R package that, upon installation, displays contact information for the maintainer and ask the user for permission to count them in our list of installations. It would also be acceptable to have the code run the first time the user calls one of our functions, instead of immediately on installation. Either way, this message should only appear once ever (unless the user reinstalls / updates the package).
What I've considered:
I know how to include a dataset for internal use, but I don't know how to change that data permanently.
We could set an environment variable / app setting, but I don't know if there's a way to make that persist after the end of the session.
Using an external service / server would be excessively heavyweight, and wouldn't allow users who don't want to be tracked to turn off the message.
Is there a good way to do this?

This can run more than once but only within a limited time window so perhaps it is good enough.
Add this code to your package and it will issue the message any time the package is loaded within 7 days of installation and thereafter it will not issue the message again until the package is updated.
It works by comparing the time the install files were created to the current time. It does not require write permissions to any directory, only read, so it should work generally.
.onLoad <- function(libname, pkgname) {
ctime <- file.info(find.package(pkgname, libname))$ctime
if (difftime(Sys.time(), ctime, unit = "day") < 7)
packageStartupMessage("This msg will go away one week after installing this package")
}

You may have to bite the bullet and store state information across sessions to show it once and only once.
Some packages which may help:
settings which retrieves user configuration settings
config which retrieves configuration information
httr which access config info
registry which offers a registry
pkgconfig offers private configuration.
but I am not sure which one reads and writes. Maybe the last one fits the bill.
Edit: Turns out that even pkgconfig does not persist values across sessions. I have solved this problem with company-local code when I had control over directories or databases to write. For public and portable code it is a little harder. I still think there is a package out there that stores user-level config on all major OSs but I cannot for now remember the name.
Edit 2: With a nod to Gabor Csardi to refresh my memory, the rappdirs solves the problem of portably supplying a config location per-user (with other tricks too, a port of a corresponding Python library). Combine this with a simple cvs or rds file to store when (at all) you last showed the message and you can now show it once and exactly once. Not even again after a package upgrade.

The following code allows you to create a file in the package library:
activate_file = paste(system.file('extdata', package = 'your_package'), 'activated.txt', sep = '/')
file.exists(activate_file)
# FALSE
file.create(activate_file)
file.exists(activate_file)
# TRUE
Now you can check in .onLoad whether or not the activated.txt file exists. The first time you show the message, and then you create activated.txt, and in the next time the package is used the onload function sees the file and can skip the message.
Advantages:
Persistent over sessions.
Platform independent way that ensures the user has write privileges to create the file.
Disadvantages:
Reinstall/upgrade wipes the activated file, thus showing the message again.
If this is not acceptable, you could try and find a persistent location, e.g. in the home drive to do this (e.g. ~/.your_package/activated.txt). Then the challenge is to make this platform independent. Maybe look at path.expand(~) to get the current users home drive, not sure if this works on Windows.

Related

Using Renv behind a proxy without password in plaintext

I'm working on R projects behind a proxy server, which is why I use the keyring library to store my proxy credentials and to authenticate on the proxy manually whenever it is required. This way, I don't need to write HTTPS_PROXY=http://usr:pw#proxy:port somewhere in plaintext - neither in global environments nor project wise. Of course, on runtime, Sys.env does contain this string but at least only for the session.
So far so good. Now I need to use virtual environments because of some package version mismatches in my projects. For that I created renv:init(). After closing and reopining the package, Rstudio seems to freeze during loading the package. I guess renv somehow tries to reach the packages (some are on cran, some are on local gitlab), which cannot work as the proxy is not set.
When I create a .Renviron including the proxy settings with my username and password, everything works fine.
Do you know a way to prevent renv to try to connect to the package sources at project start? Or do you think the problem lays somewhere else?
My best guess is that renv is trying to make a request to the active package repositories on startup, as part of its attempt to verify that the lockfile + library are in sync. If that's the case, you could disable this via:
RENV_CONFIG_SYNCHRONIZED_CHECK = FALSE
in your .Renviron. See https://rstudio.github.io/renv/reference/config.html for more details.
Alternatively, you could tell renv to load your credentials earlier on in a couple ways, e.g.
Try adding the initialization work to either your project .Rprofile or (with RENV_CONFIG_USER_PROFILE = TRUE) your user ~/.Rprofile;
Try adding the initialization code to a file located at renv/settings.R, which renv will source relatively early on during load.

Temporarily Stop R From Recording History

I find it very useful, in general, for my R history to be saved. I refer to it weekly, or more. Exploratory work in the console gradually gets refined and added to a file.
Occasionally a command will have a secret in it like an API key or searching a dataframe with confidential info, in which case I would like to be able to disable history being saved just for that one command and re-enable it immediately after. Something as much like bash's ignorespace option as possible.
I would be happy for a solution that worked in either R or RStudio, both would be even better. I know history can be manually disabled by going to Tools > Options > General > Always save history but I'm looking for either a command (or keyboard shortcut) so it can be turned on or off quickly.
Edit: something I thought might work but seemed not to help at all was settling "R_HISTFILE" to FALSE or a non-existent file. It doesn't to help the RStudio history at least. My examination of what it actually did has not been very thorough yet.
As i stated in the comment, there are ways to avoid having an API key stored in the history file. As the comment seemed to have collected some upvotes it might be worth the effort to expand it in an answer.
Occasionally a command will have a secret in it like an API key or searching a dataframe with confidential info, in which case I would like to be able to disable history being saved just for that one command and re-enable it immediately after.
I think right now, it is only possible to find a sultion for the "API key issue" with the current version of RStudio, see the comments in the links of paragraph: "Concerning the confidential info:"
However, while waiting for a soultion this page could be of interest for you: https://cran.r-project.org/web/packages/httr/vignettes/secrets.html.
Avoid having the API key stored is easier than the confidential info of a data.frame i think.
Concerning the confidential info:
Longer to introduce, but "clean":
I think its worth to add it as a feature request for the great rstudioapi package or adding up here:
https://support.rstudio.com/hc/en-us/community/posts/115000932128-RStudio-Config-Files
Related: https://github.com/rstudio/rstudio/issues/1607 (would enable user to write their own addin)
Related: https://community.rstudio.com/t/configure-rstudio-global-options-on-install/14881 (would enable user to write their own addin)
Fast to introduce, but dirty:
- a hacky dirty workaround would be to introduce an add-in to delete the last insert in the history file.
Information storage
Here is described where the settings are stored: https://support.rstudio.com/hc/en-us/articles/200534577-Resetting-RStudio-Desktop-s-State.
You can navigate to the Rstudio-desktop folder. E.g. on windows enter: %localappdata%\RStudio-Desktop in the explorer.
The global options you are looking for can be found here: ..\monitored\user-settings\user-settings.
The flag "always save history,..." in Rstudio - Tools - Global Options - General is the first value in ..\monitored\user-settings\user-settings.
Unfortunately, RStudio won´t listen on changes in that file, so you would have to restart RStudio to make changes be visible. So for now it is not an option for temporarily stopping Rstudio from recording the history.
Concerning the API key, let me summarize a few approaches of that page:
Add a "popup" to ask for the secret: rstudioapi::askForPassword()
use environmental variables. You avoid the popup, but i think it just moves the logging of the confidential info from the "history" to the envar.
finally see the keyring package for storing the data in the secret store of your OS.

update of a R package which is lazy loaded

I have several unix servers using a R package which is installed on a shared R library folder. The packages are lazy loaded (that's the default) from this shared folder.
Now I want to update the package:
1) is it possible (and clean) to do that without closing all R instances?
2) More precisely, I am concerned about the following:
2)a) The warning I get from the user interface when I try to install a package that is already loaded:
2)b)
From https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Lazy-loading,
When a package/namespace which uses it is loaded, the package/namespace environment is populated with promises for all the named objects: when these promises are evaluated they load the actual code from a database.
Does that mean that the R instance will read again from the library folder when doing the actual evaluation of each object (in which case that means I need to either deactivate the lazy loading, or close all R instances before updating the package)
3) is there an alternative way to maintain R packages on a network of servers, that are running scripts all the time, without having to put each server offline one by one)
Thanks for your input
You asked
1) is it possible (and clean) to do that without closing all R instances?
and I can assure that yes, that it is how works and done everywhere.
As for
2) More precisely, I am concerned about the following:
you are reading it wrong. An R restart is simply recommended to ensure the new package is loaded as you cannot insert it into a running session.
Further
3) is there an alternative way to maintain R packages on a network of servers, that are running scripts all the time, without having to put each server offline one by one)
you never have to take a server off-line just to update a user-space package. E.g. we don't even take them off-line when we, say, upgrade the entire Ubuntu release twice a year.

Is there a persistent location that is always writable which can be used as data cache by a package?

Is there a predefined location where an R package could store cached data? The data should persist across sessions. I was thinking about creating a subdirectory of ${R_LIBS_USER}/package_name, but I'm not sure if this is portable and if this is "allowed" if my package is installed systemwide.
The idea is the following: Create an R script mydata.R in the data subdirectory of the package which would be executed by calling data(mydata) (according to the documentation of data()). This script would load the data from the internet and cache it, if it hasn't been cached before. (If the data has been cached already, the cache will be used.) In addition, a function will be provided to invalidate the cache and/or to check if a newer version of the data is available online.
This is from the documentation of data():
Currently, four formats of data files are supported:
files ending ‘.R’ or ‘.r’ are source()d in, with the R working directory changed temporarily to the directory containing the respective file. (data ensures that the utils package is attached, in case it had been run via utils::data.)
...
Indeed, creating a file fortytwo.R in the data subdirectory of a package with the following contents:
fortytwo = data.frame(answer=42)
and then executing data(fortytwo) creates a data frame variable fortytwo. Now the question is: Where would fortytwo.R cache the data if it were difficult to compute?
EDIT: I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it. The question concerns the "data" package: Where can it store files in a per-user storage so that it is persistent across R sessions and is accessible from different R projects?
Related: Package that downloads data from the internet during installation.
There is no absolutely defined location for package-specific persistent caching in R. However, the R.cache package provides an interface for creating and managing cached data. It looks like it could be useful for your scenario.
When users load R.cache (library(R.cache)), they get the following prompt:
The R.cache package needs to create a directory that will hold cache files.
It is convenient to use one in the user's home directory, because it remains
also after restarting R. Do you wish to create the '~/.Rcache/' directory? If
not, a temporary directory (/tmp/RtmpqdUcbP/.Rcache) that is specific to this
R session will be used. [Y/n]:
They can then choose to create the cache directory in their home directory, which is presumably persistent, or to create a session-specific directory. If you make your data package depend on R.cache, you could check for the existence of the cached object(s) in its .onLoad() hook function and download the data if it isn't there. Alternatively, you could do this in the way suggested in your own question.
Have you looked at in-memory databases? H2 & Redis have bindings in R via RH2 & rredis- both allow you to share the data across r sessions- till the creating session is alive. in order to have it persisting across non-concurrent sessions, you need to write your data to the disk (assuming you can't re-create it on the fly- that would defeat the purpose of this question), and I believe the data package would be a good option. That way, you could add an update function that initializes everytime you load either package (i.e. if the code package has the right dependencies)
An example is RWeka & RWekaJars packages. Look them up on CRAN, and it should be fairly easy to understand how they work.

monitoring for changes in file(s) in real time

I have a program that monitors certain files for change. As soon as the file gets updated, the file is processed. So far I've come up with this general approach of handing "real time analysis" in R. I was hoping you guys have other approaches. Maybe we can discuss their advantages/disadvantages.
monitor <- TRUE
start.state <- file.info$mtime # modification time of the file when initiating
while(monitor) {
change.state <- file.info$mtime
if(start.state < change.state) {
#process
} else {
print("Nothing new.")
}
Sys.sleep(sleep.time)
}
Similar to the suggestion to use a system API, this can be also done using qtbase which will be a cross-platform means from within R:
dir_to_watch <- "/tmp"
library(qtbase)
fsw <- Qt$QFileSystemWatcher()
fsw$addPath(dir_to_watch)
id <- qconnect(fsw, "directoryChanged", function(path) {
message(sprintf("directory %s has changed", path))
})
cat("abc", file="/tmp/deleteme.txt")
If your system provides an API for monitoring filesystem changes, then you should use that. I believe Macs come with this. Not sure about other platforms though.
Edit:
A quick goog gave me:
Linux - http://wiki.linuxquestions.org/wiki/FAM
Win32 - http://msdn.microsoft.com/en-us/library/aa364417(VS.85).aspx
Obviously, these APIs will eliminate any polling that you require. On the other hand, they may not always be available.
Java has this: http://jnotify.sourceforge.net/ and http://java.sun.com/developer/technicalArticles/javase/nio/#6
I have a hack in mind: you can setup a CRON job/Scheduled task to run R script every n seconds (or whatever). R script checks the file hash, and if hashes don't match, runs the analysis. You can use digest::digest function, just check out the manual.
If you have lots of files that you want to monitor, then R may be too slow for this purpose. Go to your c: or / dir and see how long it takes to do file.info(dir(recursive = TRUE)). A dos or bash script may be quicker.
Otherwise, the code looks fine.
You could use the tclTaskSchedule function in the tcltk2 package to set up a function that checks for updates and runs your code. This would then be run on a regular basis (you set the timing) but would still allow you to use your R session.
I'll offer another solution for windows that I have been using in a production environment that works perfectly and that I find very easy to set up and, under the hood it basically accesses the system API for monitoring folder changes as others have mentioned, but all the "hard work" is taken care of for you. I use a freely available piece of software called Folder Monitor by Nodesoft and well described here. Once you execute this program, it appears in your system tray and from there you can specify a given directory to monitor. When files are written to the directory (or changed or modified - there are a few options from which you can choose), the program executes any program you like. I simply link the program to a windows batch that that calls my R Script. So for example, I have Folder Monitor set up to monitor a "\myservername\DropOff" UNC path for any new data files written to it. When Folder Monitor detects new files, it executes RunBatch.bat file that simply runs an R script (see here for information on setting that up) that validates the format of the expected file based on an expected naming convention for files received and then it unzips and processes the data, creating a dataframe and ultimately loads that into a SQL Server Database. It just doesn't get any easier.
One note if you decide to use this solution: take a look at the optional delay execution parameter, which might be important if files take a while to copy into the target directory from the source location.

Resources