I need to run (several times) my R script (script.R), which basically looks like this:
library(myLib)
cmd = commandArgs(TRUE)
args=myLib::parse.cmd(cmd)
myLib::exec(args)
myLib is my own package, which load some dependencies (car, minpack.lm, plyr, ggplot2). The time required for loading libraries is comparable with the time of myLib::exec, so I'm looking for a method which helps me not to load them every time I call Rscript script.R
I know about Rserve, but it looks like a little bit overkill, though it could do exactly what I need. Is there any other solutions?
P.S: I call script.R from JVM using Scala.
Briefly:
on startup you need to load your libraries
if you call repeatedly and start repeatedly you repeatedly load the libraries
you already mentioned a stateful solution (Rserve) which allows you start it once but connect and eval multiple times
so I think you answered your question.
Otherwise, I enjoy littler and have shown how it starts faster than either R or Rscript -- but the fastest approach is simply not to restart.
I tried littlr, seems amazing, but don't want to work on R v4.0.
Rserve seems cool but like you pointed out it seems to be an overkill.
I end up limiting the import to the functions I need.
For example:
library(dplyr, include.only = c("select", "mutate","group_by", "summarise", "filter" , "%>%", "row_number", 'left_join', 'rename') )
Related
Looked around and still not sure what is the difference between library()/require() and source() in r? According to this SO question: What is the difference between require() and library()? it looks like library() and require() are the same thing, maybe one is legacy. Is source() for lazy developers that don't want to create a library? When do you use each of these constructs?
The differences between library and require are already well documented in What is the difference between require() and library()?.
So, I will focus on how source differs from these. In fact they are fundamentally quite different commands. Neither library nor require actually execute any code. They simply attach a namespace, in a lazy fashion, meaning that individual functions in the package are not run unless they are actually called later. Source on the other hand does something quite different which is to execute all of the code in the file at that time.
A small caveat: packages can be made to actually run some code at the time of package loading or attaching, via the .onLoad and .onAttach functions. Have a look here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/ns-hooks.html
source runs the code in a .R file, line by line.
library and require load and attach R packages.
Is source() for lazy developers that don't want to create a library?
You're correct that source is for the cases when you don't have a package. Laziness is not the only reason, sometimes packages are not appropriate---packages provide functionality, but don't do things. Perhaps I have a script that pulls data from a database, fits a model, and makes some predictions. A package may provide functions to help me do that, but it does not actually do it. A script saved in a .R file and run with source() could run the commands and complete the task.
I do want to address this:
it looks like library() and require() are the same thing, maybe one is legacy.
They both do the same thing (load and attach a package). The main difference is that library() will throw an error and stop the script if the package is not available, whereas require() will return TRUE or FALSE depending on its success. The general consensus is that library is better so that your script stops with a nice clear error and you can install the missing package before proceeeding. The question linked has a more thorough discussion which I won't try to replicate here.
As my code evolves from version to version, I'm aware that there are some packages for which I've found better/more appropriate packages for the task at hand or whose purpose was limited to a section of code which I've now phased out.
Is there any easy way to tell which of the loaded packages are actually used in a given script? My header is beginning to get cluttered.
Update 2020-04-13
I've now updated the referenced function to use the abstract syntax tree (AST) instead of using regular expressions as before. This is a much more robust way of approaching the problem (it's still not completely ironclad). This is available from version 0.2.0 of funchir, now on CRAN.
I've just got around to writing a quick-and-dirty function to handle this which I call stale_package_check, and I've added it to my package (funchir).
e.g., if we save the following script as test.R:
library(data.table)
library(iotools)
DT = data.table(a = 1:3)
Then (from the directory with that script) run funchir::stale_package_check('test.R'), we'll get:
Functions matched from package data.table: data.table
**No exported functions matched from iotools**
Have you considered using packrat?
packrat::clean() would remove unused packages, for example.
I've written a command-line script to accomplish this task. You can find it in this Github gist. I'm sure there are edge cases that it misses, but it works pretty well, on both R scripts and Rmd files.
My approach always is to close my R script or IDE (i.e. RStudio) and then start it again.
After this I run my function without loading any dependecies/packages beforehand.
This should result in various warning and error messages telling you which functions couldn't be found and executed. This again will give you hints on what packages are necessary to load beforehand and which one you can leave out.
I have a series of R scripts for doing the multiple steps of data
analysis that I require. Some of these take a very long time and create really
large objects. I've noticed that if I just source all of them in a row (via a main.R script), the
processing for later steps takes much longer than if I source one script, save
what I need, and restart R for the next step (loading the data I need).
I was wondering if there was a
way, via Rscript or a Bash script perhaps, that I could carry this out.
There would need to be objects that persist for the first 2 scripts (which load
my external data and create the objects that will be used for all further
steps). I suppose I could also just save those and load them in further scripts.
(I would also like to pass a number of named arguments to this script, which I think I can find on other SO posts and can use something like optparse.)
So, the script would look something like this, I think:
#! /bin/bash
Rscript 01_load.R # Objects would persist, ideally
Rscript 02_create_graphs.R # Objects would persist, ideally
Rscript 03_random_graphs.R # contains code to save objects
#exit R
Rscript 04_permutation_analysis.R # would have to contain code to load data
#exit
And so on. Is there a solution to this? I'm using R 3.2.2 on 64-bit CentOS 6. Thanks.
Chris,
it sounds you should do some manual housekeeping between (or within) your steps by using gc() and maybe also rm(). For more details see help(gc) and help(rm).
So instead of exit R and restart it again you could do:
rm(list = ls())
gc()
But please note: rm(list = ls()) would throw away all your objects. Better you create a suitable list of objects you really want to throw away and pass this list to rm().
As my code evolves from version to version, I'm aware that there are some packages for which I've found better/more appropriate packages for the task at hand or whose purpose was limited to a section of code which I've now phased out.
Is there any easy way to tell which of the loaded packages are actually used in a given script? My header is beginning to get cluttered.
Update 2020-04-13
I've now updated the referenced function to use the abstract syntax tree (AST) instead of using regular expressions as before. This is a much more robust way of approaching the problem (it's still not completely ironclad). This is available from version 0.2.0 of funchir, now on CRAN.
I've just got around to writing a quick-and-dirty function to handle this which I call stale_package_check, and I've added it to my package (funchir).
e.g., if we save the following script as test.R:
library(data.table)
library(iotools)
DT = data.table(a = 1:3)
Then (from the directory with that script) run funchir::stale_package_check('test.R'), we'll get:
Functions matched from package data.table: data.table
**No exported functions matched from iotools**
Have you considered using packrat?
packrat::clean() would remove unused packages, for example.
I've written a command-line script to accomplish this task. You can find it in this Github gist. I'm sure there are edge cases that it misses, but it works pretty well, on both R scripts and Rmd files.
My approach always is to close my R script or IDE (i.e. RStudio) and then start it again.
After this I run my function without loading any dependecies/packages beforehand.
This should result in various warning and error messages telling you which functions couldn't be found and executed. This again will give you hints on what packages are necessary to load beforehand and which one you can leave out.
Question regarding RStudio. Suppose I am running a code in the console:
> code1()
assume that code1() prints nothing on the console, but code1() above takes an hour to complete. I want to work on something else while I wait for code1(). is it possible? Is there a function like runInBackground which I can use as follows
> runInBackground(code1())
> code2()
The alternatives are running two RStudios or writing a batch file that uses Rscript to run code1(), but I wanted to know if there is something easier that I can do without leaving the RStudio console. I tried to browse through R's help documentation but didn't come up with anything (or may be I didn't use the proper keywords).
The future package (I'm the author) provides this:
library("future")
plan(multisession)
future(code1())
code2()
FYI, if you use
plan(cluster, workers = c("n1", "n3", "remote.server.org"))
then the future expression is resolved on one of those machines. Using
plan(future.BatchJobs::batchjobs_slurm)
will cause it to be resolved via a Slurm job scheduler queue.
This question is closely related to Run asynchronous function in R
You can always do this, which is not ideal but works for most purposes:
shell(cmd = 'Rscript.exe some_script.R', wait=FALSE)
RStudio as of version 1.2 provides this feature. To run a script in the background select "Start Job" in the "Jobs" panel. You also have the option of copying the background job result into the working environment.
The mcparallel() function in the parallel package will do the trick, if you are on Linux, that is ...
library(parallel)
Job1 = mcparallel(code1())
JobResult1 = mccollect(Job1)