This question is basically the same as this one (but in R).
I am developing an R package that uses SparkR. I have created some unit tests (several .R files) in PkgName/inst/tests/testthat using the testthat package. For one of the tests I need to read an external data file, and since it is small and is only used in the tests, I read that it can be placed just in the same folder of the tests.
When I deploy this with Maven in a standalone Spark cluster, using "local[*]" as master, it works. However, if I try using a "remote" Spark cluster (via docker -the image has java, Spark 1.5.2 and R- where I create a master in,e.g http://172.17.0.1 and then a worker that is successfully linked to that master), then it does not work. It complains that the data file cannot be found, because it seems to look for it using an absolute path that is valid only in my local pc but not in the workers. The same happens if I use only the filename (without preceding path).
I have also tried delivering the file to the workers with the --file argument to spark-submit, and the file is successfully delivered (apparently it is placed in http://192.168.0.160:44977/files/myfile.dat although the port changes with every execution). If I try to retrieve the file's location using SparkFiles.get, I get a path (that has some random number in one of the intermediate folders) but apparently, it still refers to a path in my local machine. If I try to read the file using the path I've retrieved, it throws the same error (file not found).
I have set the environment variables like this:
SPARK_PACKAGES = "com.databricks:spark-csv_2.10:1.3.0"
SPARKR_SUBMIT_ARGS = " --jars /path/to/extra/jars --files /local/path/to/mydata.dat sparkr-shell"
SPARK_MASTER_IP = "spark://172.17.0.1:7077"
The INFO messages say:
INFO Utils: Copying /local/path/to/mydata.dat to /tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/mydata.dat
INFO SparkContext: Added file file:/local/path/to/mydata.dat at http://192.168.0.160:47110/files/mydata.dat with timestamp 1458995156845
This port changes from one run to another. From within R, I have tried:
fullpath <- SparkR:::callJStatic("org.apache.spark.SparkFiles", "get", "mydata.dat")
(here I use callJStatic only for debugging purposes) and I get
/tmp/spark-c86739c6-2c73-468f-8326-f3d03f5abd6b/userFiles-e1e77e47-2689-4882-b60f-327cf99fe5e0/ratings.dat
but when I try to read from fullpath in R, I get fileNotFound exception, probably because fullpath is not the location in the workers. When I try to read simply from "mydata.dat" (without a full path) I get the same error, because R is still trying to read from my local path where my project is placed (just appends "mydata.dat" to that local path).
I have also tried delivering my R package to the workers (not sure if this may help or not), following the correct packaging conventions (a JAR file with a strict structure and so on). I get no errors (seems the JAR file with my R package can be installed in the workers) but with no luck.
Could you help me please? Thanks.
EDIT: I think I was wrong and I don't need to access the file in the workers but just in the driver, because the access is not part of a distributed operation (just calling SparkR::read.df). Anyway it does not work. But surprisingly, if I pass the file with --files and I read it with read.table (not from SparkR but the basic R utils) passing the full path returned by SparkFiles.get, it works (although this is useless to me). Btw I'm using SparkR version 1.5.2.
Related
After having set the path for the default working directory as well as my first (and only) project within RStudio options I wonder why RStudio keeps creating an empty folder named "R" within my "/home" directory every time it is started.
Is there any file I could delete/edit (eventually create) to stop this annoying behaviour and if so, where is it located ?
System: Linux Mint v. 19.3
Software: RStudio v. 1.3.959 / R version 3.4.4
Thanks in advance for any hints.
Yes, you can prevent the creation of the R directory — R is configurable via a set of environment variables.
However, setting these correctly isn’t trivial. The first issue is that many R packages are sensitive to the R version they’re installed with. If you upgrade R and try to load the existing package, it may break. Therefore, the R package library path should be specific to the R version.
On clusters, an additional issue is that the same library path might be read by various cluster nodes that run on different architectures; this is rare, but it happens. In such cases, compiled R packages might need to be different depending on the architecture.
Consequently, in general the R library path needs to be specific both to the R version and the system architecture.
Next, even if you configure an alternative path R will silently ignore it if it doesn’t exist. So be sure to manually create the directory that you’ve configured.
Lastly, where to put this configuration? One option would be to put it into the user environment file, the path of which can be specified with the environment variable R_ENVIRON_USER — it defaults to $HOME/.Renviron. This isn’t ideal though, because it means the user can’t temporarily override this setting when calling R: variables in this file override the calling environment.
Instead, I recommend setting this in the user profile (e.g. $HOME/.profile). However, when you use a desktop launcher to launch your RStudio, this file won’t be read, so be sure to edit your *.desktop file accordingly.1
So in sum, add the following to your $HOME/.profile:
export R_LIBS_USER=${XDG_DATA_HOME:-$HOME/.local/share}/R/%p-library/%v
And make sure this directory exists: re-source ~/.profile (launching a new shell inside the current one is not enough), and execute
mkdir -p "$(Rscript -e 'cat(Sys.getenv("R_LIBS_USER"))')"
The above is using the XDG base dir specification, which is the de-facto standard on Linux systems.2 The path is using the placeholders %p and %v. R will fill these in with the system platform and the R version (in the form major.minor), respectively.
If you want to use a custom R configuration file (“user profile”) and/or R environment file, I suggest setting their location in the same way, by configuring R_PROFILE_USER and R_ENVIRON_USER (since their default location, once again, is in the user home directory):
export R_PROFILE_USER=${XDG_CONFIG_HOME:-$HOME/.config}/R/rprofile
export R_ENVIRON_USER=${XDG_CONFIG_HOME:-$HOME/.config}/R/renviron
1 I don’t have a Linux desktop system but I believe that editing the Env entry to the following should do it:
Exec=env R_LIBS_USER=${XDG_DATA_HOME:-$HOME/.local/share}/R/%p-library/%v /path/to/rstudio
2 Other systems require different handling. On macOS, the canonical setting for the library location would be $HOME/Library/Application Support/R/library/%v. However, setting environment variables on macOS for GUI applications is frustratingly complicated.
On Windows, the canonical location is %LOCALAPPDATA%/R/library/%v. To set this variable, use [Environment]::SetEnvironmentVariable in PowerShell or, when using cmd.exe, use setx.
I tried to run a Python script from R with:
system('python script.py arg1 arg2')
And got an error:
ImportError: No module named pandas
This was a bit of a surprise since the script was working from the terminal as expected. Having encountered this type of issue before (with knitr, whence the engine.path chunk option), I know to check:
Sys.which('python')
# python
# "/usr/bin/python"
And compare it to the command line:
$ which python
# /Users/michael.chirico/anaconda2/bin/python
(i.e., the error arises because I have pandas installed for the anaconda distribution, though TBH I don't know why I have a different distribution)
Hence I can fix my issue by running:
system('/Users/michael.chirico/anaconda2/bin/python script.py arg1 arg2')
My question is two-fold:
How does R's system/Sys.which find a different python than my terminal?
How can I fix this besides writing out the full binary path each time?
I read ?Sys.which for some hints, but to no avail. In particular, ?Sys.which suggests Sys.which is using which:
This is an interface to the system command which
This is clearly (?) untrue; to be sure, I checked Sys.which('which') and which which to confirm both are pointing to /usr/bin/which (goaded on by this tidbit):
On a Unix-alike the full path to which (usually /usr/bin/which) is found when R is installed.
To the latter, on a whim I tried Sys.setenv(python = '/Users/michael.chirico/anaconda2/bin/python') to no avail.
As some of the comments hint, this is a problem that arises because the PATH environment variable is different for programs launched by Finder (or the Dock) than it is in the Terminal. There are ways to set the PATH for Dock-launched applications, but they aren't pretty. Here's a place to start looking if you want to go that route:
https://apple.stackexchange.com/questions/51677/how-to-set-path-for-finder-launched-applications
The other thing you can do, which is probably more straightforward, is tell R to set the PATH variable when it starts up, using Sys.setenv to add the path to your desired Python instance. You can do that for just one project, for your whole user account, or for the whole system, by placing the command in a .Rprofile file in the corresponding location. More information on how to do this here:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html
I work in an environment where linking of dynamic libraries are restricted to certain locations. When I use RStudio and request a new C++ file I get the "Hello World" template. When I try to compile that and link that in by clicking on "Source" in RStudio, I get an error:
LoadLibrary failure: Access is denied.
This error is because the library was located in a space which is not allowed to be able to load DLL files. To maneuver around this limitation, I would like to determine how to tell RCpp to place the temporary dll's (not in a package) in a specific location.
I know that Dirk has suggested that this is not in the scope of RCpp and that all code should live in packages, but that will not be he most user friendly environment for the users here. I suspect that most will use RStudio projects with GIT.
So, that being said, is there an environment variable that I can mangle to get RCpp to place temporary dll files in a specific place. Or is there some other mechanism which I can use to alter this?
Try setting TMPDIR which R respects. This is indeed not an Rcpp issue but a generic R CMD build / R CMD INSTALL issue.
From help(tempfile):
The environment variables TMPDIR, TMP and TEMP are checked in
turn and the first found which points to a writable directory is
used: if none succeeds /tmp is used.
PS Rcpp with lower-case C.
I have attempted to install R and R studio on the local drive on my work computer as opposed to the organization network folder because anything that runs through the network is really slow. When installing, the destination path shows that it's my local C:drive. However, when I install a new package, the default path shown is my network drive and there is no option to change:
.libPaths()
[1] "\\\\The library/path/I/don't/want"
[2] "C:/Program Files/R/R-3.2.1/library"
I'm running windows 7 professional. How can I remove library path [1] and make path [2] my primary for all base packages and all new packages that I install?
Windows 7/10: If your C:\Program Files (or wherever R is installed) is blocked for writing, as mine is, then you'll get frustrated editing RProfile.site (as I did). As specified in the accepted answer, I updated R_LIBS_USER and it worked. However, even after reading the fine manual several times and extensive searching, it took me several hours to do this. In the spirit of saving someone else time...
Let's assume you want your packages to reside in C:\R\Library:
Create the folder C:\R\Library. Next I need to add this folder to the R_LIBS_USER path:
Click Start --> Control Panel --> User Accounts --> Change my environmental variables
The Environmental Variables window pops up. If you see R_LIBS_USER, highlight it and click Edit. Otherwise click New. Both actions open a window with fields for Variable and Value.
In my case, R_LIBS_USER was already there, and Value was a path to my desktop. I added to the path the folder that I created, separated by semicolon. C:\R\Library;C:\Users\Eric.Krantz\Desktop\R stuff\Packages.
(NOTE: In the last step, I could have removed the path to the Desktop location and simply left C:\R\Library).
See help(Startup) and help(.libPaths) as you have several possibilities where this may have gotten set. Among them are
setting R_LIBS_USER
assigning .libPaths() in .Rprofile or Rprofile.site
and more.
In this particular case you need to go backwards and unset whereever \\\\The library/path/I/don't/want is set.
To otherwise ignore it you need to override it use explicitly i.e. via
library("somePackage", lib.loc=.libPaths()[-1])
when loading a package.
Facing the very same problem (avoiding the default path in a network) I came up to this solution with the hints given in other answers.
The solution is editing the Rprofile file to overwrite the variable R_LIBS_USER which by default points to the home directory.
Here the steps:
Create the target destination folder for the libraries, e.g.,
~\target.
Find the Rprofile file. In my case it was at C:\Program Files\R\R-3.3.3\library\base\R\Rprofile.
Edit the file and change the definition the variable R_LIBS_USER. In my case, I replaced the this line file.path(Sys.getenv("R_USER"), "R", with file.path("~\target", "R",.
The documentation that support this solution is here
Original file with:
if(!nzchar(Sys.getenv("R_LIBS_USER")))
Sys.setenv(R_LIBS_USER=
file.path(Sys.getenv("R_USER"), "R",
"win-library",
paste(R.version$major,
sub("\\..*$", "", R.version$minor),
sep=".")
))
Modified file:
if(!nzchar(Sys.getenv("R_LIBS_USER")))
Sys.setenv(R_LIBS_USER=
file.path("~\target", "R",
"win-library",
paste(R.version$major,
sub("\\..*$", "", R.version$minor),
sep=".")
))
Windows 10 on a Network
Having your packages stored on the network drive can slow down the performance of R / R Studio considerably, and you spend a lot of time waiting for the libraries to load/install, due to the bottlenecks of having to retrieve and push data over the server back to your local host. See the following for instructions on how to create an .RProfile on your local machine:
Create a directory called C:\Users\xxxxxx\Documents\R\3.4 (or whatever R version you are using, and where you will store your local R packages- your directory location may be different than mine)
On R Console, type Sys.getenv("HOME") to get your home directory (this is where your .RProfile will be stored and R will always check there for packages- and this is on the network if packages are stored there)
Create a file called .Rprofile and place it in :\YOUR\HOME\DIRECTORY\ON_NETWORK (the directory you get after typing Sys.getenv("HOME") in R Console)
File contents of .Rprofile should be like this:
#search 2 places for packages- install new packages to first directory- load built-in packages from the second (this is from your base R package- will be different for some)
.libPaths(c("C:\Users\xxxxxx\Documents\R\3.4", "C:/Program Files/Microsoft/R Client/R_SERVER/library"))
message("*** Setting libPath to local hard drive ***")
#insert a sleep command at line 12 of the unpackPkgZip function. So, just after the package is unzipped.
trace(utils:::unpackPkgZip, quote(Sys.sleep(2)), at=12L, print=TRUE)
message("*** Add 2 second delay when installing packages, to accommodate virus scanner for R 3.4 (fixed in R 3.5+)***")
# fix problem with tcltk for sqldf package: https://github.com/ggrothendieck/sqldf#problem-involvling-tcltk
options(gsubfn.engine = "R")
message("*** Successfully loaded .Rprofile ***")
Restart R Studio and verify that you see that the messages above are displayed.
Now you can enjoy faster performance of your application on local host, vs. storing the packages on the network and slowing everything down.
I was struggling for a while with this as my work computer (with Windows 10) created the default user library on a network drive, which would slow down R and RStudio to an unusable state.
In case this helps someone, this is the easiest way I found, without requiring admin rights:
make sure the directory you want to install your packages into exists. If you want to respect the convention, use: C:\Users\username\R\win-library\rversion (for example, something like: C:\Users\janebloggs\R\win-library\3.6)
create a .Renviron file in your home directory (which might be on the network drive?), and in it, write one single line that defines the R_LIBS_USER variable to be your custom path:
R_LIBS_USER=C:\Users\janebloggs\R\win-library\3.6
(feel free to add comments too, with lines starting with #)
If a .Renviron file exists, R will read it at startup and use the variables as they are defined in there, before running the code in the .Rprofile. You can read about it in help(Startup).
Now it should be persistent between sessions!
After a couple of hours of trying to solve the issue in several ways, some of which are described here, for me (on Win 10) the option of creating a Renviron file worked, but a little different from what was written here above.
The task is to change the value of the variable R_LIBS_USER. To do this two steps needed:
Create the file named Renviron (without dot) in the folder \Program\etc\ (Program is the directory where R is installed--for example, for me it was C:\Program Files\R\R-4.0.0\etc)
Insert a line in Renviron with new path: R_LIBS_USER = "C:/R/Library"
After that, reboot R and use .libPaths() to confirm the default directory changed.
I think I tried all of the above and it didn't work for me. This worked, though:
In home directory, make a file called ".Renviron"
In that file, write:
.libPaths(new = "/my/path/to/libs")
Save and restart R if you had it open
I'm using a Windows 7 x64 machine with R-3.1.0. I installed the Rserve package through Rstudio.
The start of Rserve is successful with the following code in Rstudio:
library(Rserve)
Rserve()
I got the following output:
Starting Rserve...
"C:\R\R-31~1.0\library\Rserve\libs\x64\Rserve.exe"
My problem is that I couldn't locate the configuration file. Apparently it can't be "/etc/Rserv.conf".
I did come across a webpage saying that the config file is Rserv.cfg in the working directory (unless changed at compile-time). But which working directory? I have checked the working directory of the current R project as well as the Rserve library directory, but it was not there...Could someone help me with this please? Thank you.
Rserve does not automatically come with a config file, you must make one. Best steps for doing so:
Navigate to the file where you just installed Rserve.exe (C:\R\R-31~1.0\library\Rserve\libs\x64\R, based on the message you copied here)
Find Rserve.exe, Reserve_d.exe, and Rserve.dll there. Copy these files.
Navigate to where R.dll is on your computer. This is probably C:\Program Files\R\R-3.1.3\bin\x64, but may be different depending on where you installed R to.
Copy the 3 files mentioned above to this location.
Create a text file here named "Rserv.cfg" with the arguments you are looking for, such as port 6312 or library(mvoutlier). Yes, I know that this is different from the documentation, but if you start Rserve_d.exe you will see that this is the file it is looking for. I have not had success naming it anything else.
You can start Rserve by specifying the location of the config file. In R instead of just Rserve() try the following:
Rserve(args="--RS-conf C:\\folder\\Rserv.cfg")
If path is more complicated you need to massage it a little bit:
Rserve(args="--RS-conf C:\\PROGRA~1\\R\\R-215~1.2\\library\\Rserve\\Rserv.cfg")
Look in the $RHOME/bin directory
If you can't find it here is a different way to approach it:
Download Rserve at [http://rforge.net/snapshot/Rserve_.tar.gz], and save it in your desired directory
Run R CMD INSTALL Rserve_.tar.gz
This allows you to leave Rserve where you want it.
After looking at the Rserve source code and making some test I found that on Windows platform Rserve try to load the configuration file from the current working directory. Also pay attention because on Windows the file name is RServ.cfg and not Rserv.conf as documented.
The current working directory depends of the process, for example using RStudio by default it is your Documents and Settings folder:
C:\Users\[username]\Documents
but can be changed in the "Global Options" of the IDE
So you can create an "RServ.cfg" text file in that directory with your needed options and starting RServe in the usual way in RStudio
Rserve()
will load your configuration.