R Temporary Directory Set to External Drive - r

When running the RecordLinkage package in R on a large dataset, the GUI failed and closed down.
I realize now that as a result of R's activity, 120GB of data had been stored in my Windows Temporary folder (file format .ff), running into the existing limits of my HD.
I would like to plug into an external drive with more space, and set the temporary directory for R to use there.
Can I do this in R, before running my analysis? What is the command?
Is there another way around this problem I'm not thinking about? Thanks kindly.

If you are generating *.ff files, you appear to making use of the ff package.
Assuming this to be true, you should be able to set the fftempdir option as follows...
...
library(ff)
options("fftempdir"="/EnterYourFilePathHere/"...)
...
Just replace EnterYourFilePathHere with a path to your external hdd.
You should read more about the ff package and the fftempdir in the package documentation: http://cran.r-project.org/web/packages/ff/index.html
The package handles the temp files differently (e.g. deletion, etc.) depending on whether or not it takes fftempdir from your working directory (i.e. getwd()) or from the fftempdir option.

Related

How can I use multiple library paths?

I'm trying to set up an easy to use R development environment for multiple users. R is installed along with a set of other dev tools on an NFS mount.
I want to create a core set of R packages that also live on NFS so n users don't need to install their own copies of the same packages n times. Then, I was hoping users can install one off packages to a local R library. Has anyone worked with an R setup like this before? From the doc, it looks doable by adding both the core package and personal package file paths to .libPaths().
You want to use the .Renviron file (see ?Startup).
There are three places to put the file:
Site wide in R_HOME/etc/Renviron.site
Local in either the current working directory or the home area
In this file you can specify R_LIBS and the R_LIBS_SITE environment variables.
For your particular problem, you probably want to add the NFS drive location to R_LIBS_SITE in the R_HOME/etc/Renviron.site file.
## To get R_HOME
Sys.getenv("R_HOME")

RStudio: using different package versions for each .Rproj

I have a few older R projects I'm working with, which are dependent on several currently deprecated (or heavily modified) packages. In order for everything to work smoothly I use older versions of those packages, which I have saved in another folder and load up manually to %userprofile%\documents\R\win-library\3.3 when necessary. However, this is not convenient, especially if I want to run multiple projects simultaneously, some of which requires the new and updated versions of the packages.
My question - is there a way to specify custom directories for each .Rproj from which it would take and load the libraries?
You can solve this much simpler:
Have a top-level directory for each project, call projA, projB, ...
Within each of these, create a directory libs/, say.
And within each of these directories have a file .Rprofile with a single assignment such as .libPaths("./libs")
Now when you start R in the different project directories, each will a separate library directory preceding the path, allowing you to place per-projects overrides there.
In an nutshell, the approach outlines here allows you to keep the local and modified packages around as you please. (You can even assign common directories via .libPaths() if you so choose.)
The nice things is that this will
work with any R invocation, batch or GUI or RStudio or shiny or ...
does not depend on any other tools, and hence
does not rely on RStudio or .Rprof files -- though you are free to use RStudio as well.
As so often, Base R is there for you.
One option is to use the checkpoint package by Revolution Analytics.
You can indicate for each main R file in a project the date for which you which you wish to load a set of packages. You can read a bit more about it here.
To pull snapshotted packages from a given date from the mirror use getValidSnapshots(mranRootUrl = mranUrl()).
To create a checkpoint:
# Create temporary project and set working directory
example_project <- paste0("~/checkpoint_example_project_", Sys.Date())
dir.create(example_project, recursive = TRUE)
oldwd <- setwd(example_project)
# Write dummy code file to project
cat("library(MASS)", "library(foreach)",
sep="\n",
file="checkpoint_example_code.R")
# Create a checkpoint by specifying a snapshot date
library(checkpoint)
checkpoint("2014-09-17")
# Check that CRAN mirror is set to MRAN snapshot
getOption("repos")
# Check that library path is set to ~/.checkpoint
.libPaths()
# Check which packages are installed in checkpoint library
installed.packages()
# cleanup
unlink(example_project, recursive = TRUE)
setwd(oldwd)

SparkR cannot access to file in workers

This question is basically the same as this one (but in R).
I am developing an R package that uses SparkR. I have created some unit tests (several .R files) in PkgName/inst/tests/testthat using the testthat package. For one of the tests I need to read an external data file, and since it is small and is only used in the tests, I read that it can be placed just in the same folder of the tests.
When I deploy this with Maven in a standalone Spark cluster, using "local[*]" as master, it works. However, if I try using a "remote" Spark cluster (via docker -the image has java, Spark 1.5.2 and R- where I create a master in,e.g http://172.17.0.1 and then a worker that is successfully linked to that master), then it does not work. It complains that the data file cannot be found, because it seems to look for it using an absolute path that is valid only in my local pc but not in the workers. The same happens if I use only the filename (without preceding path).
I have also tried delivering the file to the workers with the --file argument to spark-submit, and the file is successfully delivered (apparently it is placed in http://192.168.0.160:44977/files/myfile.dat although the port changes with every execution). If I try to retrieve the file's location using SparkFiles.get, I get a path (that has some random number in one of the intermediate folders) but apparently, it still refers to a path in my local machine. If I try to read the file using the path I've retrieved, it throws the same error (file not found).
I have set the environment variables like this:
SPARK_PACKAGES = "com.databricks:spark-csv_2.10:1.3.0"
SPARKR_SUBMIT_ARGS = " --jars /path/to/extra/jars --files /local/path/to/mydata.dat sparkr-shell"
SPARK_MASTER_IP = "spark://172.17.0.1:7077"
The INFO messages say:
INFO Utils: Copying /local/path/to/mydata.dat to /tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/mydata.dat
INFO SparkContext: Added file file:/local/path/to/mydata.dat at http://192.168.0.160:47110/files/mydata.dat with timestamp 1458995156845
This port changes from one run to another. From within R, I have tried:
fullpath <- SparkR:::callJStatic("org.apache.spark.SparkFiles", "get", "mydata.dat")
(here I use callJStatic only for debugging purposes) and I get
/tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/ratings.dat
but when I try to read from fullpath in R, I get fileNotFound exception, probably because fullpath is not the location in the workers. When I try to read simply from "mydata.dat" (without a full path) I get the same error, because R is still trying to read from my local path where my project is placed (just appends "mydata.dat" to that local path).
I have also tried delivering my R package to the workers (not sure if this may help or not), following the correct packaging conventions (a JAR file with a strict structure and so on). I get no errors (seems the JAR file with my R package can be installed in the workers) but with no luck.
Could you help me please? Thanks.
EDIT: I think I was wrong and I don't need to access the file in the workers but just in the driver, because the access is not part of a distributed operation (just calling SparkR::read.df). Anyway it does not work. But surprisingly, if I pass the file with --files and I read it with read.table (not from SparkR but the basic R utils) passing the full path returned by SparkFiles.get, it works (although this is useless to me). Btw I'm using SparkR version 1.5.2.

How do I change the default library path for R packages

I have attempted to install R and R studio on the local drive on my work computer as opposed to the organization network folder because anything that runs through the network is really slow. When installing, the destination path shows that it's my local C:drive. However, when I install a new package, the default path shown is my network drive and there is no option to change:
.libPaths()
[1] "\\\\The library/path/I/don't/want"
[2] "C:/Program Files/R/R-3.2.1/library"
I'm running windows 7 professional. How can I remove library path [1] and make path [2] my primary for all base packages and all new packages that I install?
Windows 7/10: If your C:\Program Files (or wherever R is installed) is blocked for writing, as mine is, then you'll get frustrated editing RProfile.site (as I did). As specified in the accepted answer, I updated R_LIBS_USER and it worked. However, even after reading the fine manual several times and extensive searching, it took me several hours to do this. In the spirit of saving someone else time...
Let's assume you want your packages to reside in C:\R\Library:
Create the folder C:\R\Library. Next I need to add this folder to the R_LIBS_USER path:
Click Start --> Control Panel --> User Accounts --> Change my environmental variables
The Environmental Variables window pops up. If you see R_LIBS_USER, highlight it and click Edit. Otherwise click New. Both actions open a window with fields for Variable and Value.
In my case, R_LIBS_USER was already there, and Value was a path to my desktop. I added to the path the folder that I created, separated by semicolon. C:\R\Library;C:\Users\Eric.Krantz\Desktop\R stuff\Packages.
(NOTE: In the last step, I could have removed the path to the Desktop location and simply left C:\R\Library).
See help(Startup) and help(.libPaths) as you have several possibilities where this may have gotten set. Among them are
setting R_LIBS_USER
assigning .libPaths() in .Rprofile or Rprofile.site
and more.
In this particular case you need to go backwards and unset whereever \\\\The library/path/I/don't/want is set.
To otherwise ignore it you need to override it use explicitly i.e. via
library("somePackage", lib.loc=.libPaths()[-1])
when loading a package.
Facing the very same problem (avoiding the default path in a network) I came up to this solution with the hints given in other answers.
The solution is editing the Rprofile file to overwrite the variable R_LIBS_USER which by default points to the home directory.
Here the steps:
Create the target destination folder for the libraries, e.g.,
~\target.
Find the Rprofile file. In my case it was at C:\Program Files\R\R-3.3.3\library\base\R\Rprofile.
Edit the file and change the definition the variable R_LIBS_USER. In my case, I replaced the this line file.path(Sys.getenv("R_USER"), "R", with file.path("~\target", "R",.
The documentation that support this solution is here
Original file with:
if(!nzchar(Sys.getenv("R_LIBS_USER")))
Sys.setenv(R_LIBS_USER=
file.path(Sys.getenv("R_USER"), "R",
"win-library",
paste(R.version$major,
sub("\\..*$", "", R.version$minor),
sep=".")
))
Modified file:
if(!nzchar(Sys.getenv("R_LIBS_USER")))
Sys.setenv(R_LIBS_USER=
file.path("~\target", "R",
"win-library",
paste(R.version$major,
sub("\\..*$", "", R.version$minor),
sep=".")
))
Windows 10 on a Network
Having your packages stored on the network drive can slow down the performance of R / R Studio considerably, and you spend a lot of time waiting for the libraries to load/install, due to the bottlenecks of having to retrieve and push data over the server back to your local host. See the following for instructions on how to create an .RProfile on your local machine:
Create a directory called C:\Users\xxxxxx\Documents\R\3.4 (or whatever R version you are using, and where you will store your local R packages- your directory location may be different than mine)
On R Console, type Sys.getenv("HOME") to get your home directory (this is where your .RProfile will be stored and R will always check there for packages- and this is on the network if packages are stored there)
Create a file called .Rprofile and place it in :\YOUR\HOME\DIRECTORY\ON_NETWORK (the directory you get after typing Sys.getenv("HOME") in R Console)
File contents of .Rprofile should be like this:
#search 2 places for packages- install new packages to first directory- load built-in packages from the second (this is from your base R package- will be different for some)
.libPaths(c("C:\Users\xxxxxx\Documents\R\3.4", "C:/Program Files/Microsoft/R Client/R_SERVER/library"))
message("*** Setting libPath to local hard drive ***")
#insert a sleep command at line 12 of the unpackPkgZip function. So, just after the package is unzipped.
trace(utils:::unpackPkgZip, quote(Sys.sleep(2)), at=12L, print=TRUE)
message("*** Add 2 second delay when installing packages, to accommodate virus scanner for R 3.4 (fixed in R 3.5+)***")
# fix problem with tcltk for sqldf package: https://github.com/ggrothendieck/sqldf#problem-involvling-tcltk
options(gsubfn.engine = "R")
message("*** Successfully loaded .Rprofile ***")
Restart R Studio and verify that you see that the messages above are displayed.
Now you can enjoy faster performance of your application on local host, vs. storing the packages on the network and slowing everything down.
I was struggling for a while with this as my work computer (with Windows 10) created the default user library on a network drive, which would slow down R and RStudio to an unusable state.
In case this helps someone, this is the easiest way I found, without requiring admin rights:
make sure the directory you want to install your packages into exists. If you want to respect the convention, use: C:\Users\username\R\win-library\rversion (for example, something like: C:\Users\janebloggs\R\win-library\3.6)
create a .Renviron file in your home directory (which might be on the network drive?), and in it, write one single line that defines the R_LIBS_USER variable to be your custom path:
R_LIBS_USER=C:\Users\janebloggs\R\win-library\3.6
(feel free to add comments too, with lines starting with #)
If a .Renviron file exists, R will read it at startup and use the variables as they are defined in there, before running the code in the .Rprofile. You can read about it in help(Startup).
Now it should be persistent between sessions!
After a couple of hours of trying to solve the issue in several ways, some of which are described here, for me (on Win 10) the option of creating a Renviron file worked, but a little different from what was written here above.
The task is to change the value of the variable R_LIBS_USER. To do this two steps needed:
Create the file named Renviron (without dot) in the folder \Program\etc\ (Program is the directory where R is installed--for example, for me it was C:\Program Files\R\R-4.0.0\etc)
Insert a line in Renviron with new path: R_LIBS_USER = "C:/R/Library"
After that, reboot R and use .libPaths() to confirm the default directory changed.
I think I tried all of the above and it didn't work for me. This worked, though:
In home directory, make a file called ".Renviron"
In that file, write:
.libPaths(new = "/my/path/to/libs")
Save and restart R if you had it open

What type of object is an R package?

Probably a pretty basic question but a friend and I tried to run str(packge_name) and R threw us an error. Now that I'm looking at it, I'm wondering if an R package is like a .zip file in that it is a collection of objects, say pictures and songs, but not a picture or song itself.
If I tried to open a zip of pictures with an image viewer, it wouldn't know what to do until I unzipped it - just like I can't call str(forecast) but I can call str(ts) once I've loaded the forecast package into my library...
Can anyone set me straight?
R packages are generally distributed as compressed bundles of files. They can either be in "binary" form which are preprocessed at a repository to compile any C or Fortran source and create the proper headers, or they can be in source form where the various required files are available to be used in the installation process, but this requires that the users have the necessary compilers and tools installed at locations where the R build process using OS system resources can get at them.
If you read the documentation for a package at CRAN you see they are distributed in set of compressed formats that vary depending on the OS-targets:
Package source: Rcpp_0.11.3.tar.gz # the Linus/UNIX targets
Windows binaries: r-devel: Rcpp_0.11.3.zip, r-release: Rcpp_0.11.3.zip, r-oldrel: Rcpp_0.11.3.zip
OS X Snow Leopard binaries: r-release: Rcpp_0.11.3.tgz, r-oldrel: Rcpp_0.11.3.tgz
OS X Mavericks binaries: r-release: Rcpp_0.11.3.tgz
Old sources: Rcpp archive # not really a file but a web link
Once installed an R package will have a specified directory structure. The DESCRIPTION file is a text file with specific entries for components that determine whether the local installation meets the dependencies of the package. There are NAMESPACE, LICENSE, and INDEX files. There are directories named '/help', '/html', '/Meta', '/R', and possibly '/libs', '/demo', '/data', '/unitTests', and others.
This is the tree at the top of the ../library/Rcpp package directory:
$ ls
CITATION NAMESPACE THANKS examples libs
DESCRIPTION NEWS.Rd announce help prompt
INDEX R discovery html skeleton
Meta README doc include unitTests
So in the "life-cycle" of a package, there will be initially a series of required and optional files, which then get processed by the BUILD and CHECK mechanisms into an installed package, which than then get compressed for distribution, and later unpacked into a specified directory tree on the users machine. See these help pages:
?.libPaths # also describes .Library()
?package.skeleton
?install.packages
?INSTALL
And of course read Writing R Extensions, a document that ships with every installation of R.
Your question is:
What type of object is an R package?
Somehow, I’m still missing an answer to this exact question. So here goes:
As far as R is concerned, an R package is not an object. That is, it’s not an object in R’s type system. R is being a bit difficult, because it allows you to write
library(pkg_name)
Without requiring you to define pkg_name anywhere prior. In contrast, other objects which you are using in R have to be defined somewhere – either by you, or by some package that’s loaded either explicitly or implicitly.
This is unfortunate, and confuses people. Therefore, when you see library(pkg_name), think
library('pkg_name')
That is, imagine the package name in quotes. This does in fact work just as expected. The fact that the code also works without quotes is a peculiarity of the library function, known as non-standard evaluation. In this case, it’s mostly an unfortunate design decision (but there are reasons).
So, to repeat the answer: a package isn’t a type of R object1. For R, it’s simply a name which refers to a known location in the file system, similar to what you’ve assumed. BondedDust’s answer goes into detail to explain that structure, so I shan’t repeat it here.
1 For super technical details, see Joshua’s and Richard’s comments below.
From R's own documentation:
Packages provide a mechanism for loading optional code, data and
documentation as needed.…A package is a directory of files which
extend R, a source package (the master files of a package), or a
tarball containing the files of a source package, or an installed
package, the result of running R CMD INSTALL on a source package. On
some platforms (notably OS X and Windows) there are also binary
packages, a zip file or tarball containing the files of an installed
package which can be unpacked rather than installing from sources. A
package is not a library.
So yes, a package is not the functions within it; it is a mechanism to have R be able to use the functions or data which comprise the package. Thus, it needs to be loaded first.
I am reading Hadley's book Advanced-R (Chapter 6.3 - functions, p.79) and this quote will cover you I think:
Every operation is a function call
“To understand computations in R, two slogans are helpful:
Everything that exists is an object.
Everything that happens is a function call."
— John Chambers
According to that using library(name_of_library) is a function call that will load the package. Every little bit that has been loaded i.e. functions or data sets are objects which you can use by calling other functions. In that sense a package is not an object in any of R's environments until it is loaded. Then you can say that it is a collection of the objects it contains and which are loaded.

Resources