Suppose I have compiled a code with two functions f1 and f2 using Rcpp::sourceCpp('myPath/myCode.cpp') and I located the sourceCpp_123123.dll that was created.
Now suppose I have two different batch files on Windows 7, both of which run RScript -e "source('myRCode1.r')" and RScript -e "source('myRCode2.r')" respectively. I want my two functions f1 and f2 to be available with each run of RScript.
I can certainly put in my code myRCode1.r and myRCode2.r to do Rcpp::sourceCpp('myPath/myCode.cpp') before rest of the code is run. Another alternative is to convert my two functions f1 and f2 into a package which is a little more involved process.
Is there any easy way to simply load the sourceCpp_123123.dll within myRCode1.r and myRCode2.r ?
I tried dyn.load("myDllPath\sourceCpp_123123.dll") with various permutations and combinations of now=TRUE, local=TRUE, now=FALSE, local=FALSE, but none of the options loaded the two functions.
However, when I tried getLoadedDLLs, I see that sourceCpp_123123.dll has been loaded!
As I commented earlier, this is where you want to use a package.
Or if you are one of those people opposed to packages on whichever grounds (none of which typically convince an experienced R user / programmer who will almost always advocate use of package), then you could cheat and just combine your two files into one.
Which would couple the two function sets. And you may already know that you can mix C++ and R code in one file as we do all the time over at the Rcpp Gallery...
Related
I recently started looking into Makefiles to keep track of the scripts inside my research project. To really understand what is going on, I would like to understand the contents of .Rout files produced by R CMD BATCH a little better.
Christopher Gandrud is using a Makefile for his book Reproducible research with R and RStudio. The sample project (https://github.com/christophergandrud/rep-res-book-v3-examples/tree/master/data) has only three .R files: two of them download and clean data, the third one merges both datasets. They are invoked by the following lines of the Makefile:
# Key variables to define
RDIR = .
# Run the RSOURCE files
$(RDIR)/%.Rout: $(RDIR)/%.R
R CMD BATCH $<
None of the first two files outputs data; nor does the merge script explicitly import data - it just uses the objects created in the first two scripts. So how is the data preserved between the scripts?
To me it seems like the batch execution happens within the same R environment, preserving both objects and loaded packages. Is this really the case? And is it the .Rout file that transfers the objects from one script to the other or is it a property of the batch execution itself?
If the working environment is really preserved between the scripts, I see a lot of potential for issues if there are objects with the same names or functions with the same names from different packages. Another issue of this setup seems to be that the Makefile cannot propagate changes in the first two files downstream because there is no explicit input/prerequisite for the merge script.
I would appreciate to learn if my intuition is right and if there are better ways to execute R files in a Makefile.
By default R CMD BATCH will save your workspace to a hidden .Rdata file after running unless you choose --no-save. That's why it's not really the recommended way to run R script. The recommended way is with Rscript which will not save by default. You must write code explicitly to save if that's what you want. This is different than the Rout file which should only have the output from the commands run in the script.
In this case, execution doesn't happen in the exact same environment. R is still called three times, but that environment is serialized and reloaded between each run.
You are correct that there may be a lot of problems with saving and re-loading workspaces by default. That's why most people recommend you do not do that. But in this cause, the author just figured it made things easier for their workflow so they used it. It would be better to be more explicit about input and output files in general though.
I need to "industrialize" an R code for a data science project, because the project will be rerun several times in the future with fresh data. The new code should be really easy to follow even for people who have not worked on the project before and they should be able to redo the whole workflow quite quickly. Therefore I am looking for tips, suggestions, resources and best-practices on how to achieve this objective.
Thank you for your help in advance!
You can make an R package out of your project, because it has everything you need for a standalone project that you want to share with others :
Easy to share, download and install
R has a very efficient documentation system for your functions and objects when you work within R Studio. Combined with roxygen2, it enables you to document precisely every function, and makes the code clearer since you can avoid commenting with inline comments (but please do so anyway if needed)
You can specify quite easily which dependancies your package will need, so that every one knows what to install for your project to work. You can also use packrat if you want to mimic python's virtualenv
R also provide a long format documentation system, which are called vignettes and are similar to a printed notebook : you can display code, text, code results, etc. This is were you will write guidelines and methods on how to use the functions, provide detailed instructions for a certain method, etc. Once the package is installed they are automatically included and available for all users.
The only downside is the following : since R is a functional programming language, a package consists of mainly functions, and some other relevant objects (data, for instance), but not really scripts.
More details about the last point if your project consists in a script that calls a set of functions to do something, it cannot directly appear within the package. Two options here : a) you make a dispatcher function that runs a set of functions to do the job, so that users just have to call one function to run the whole method (not really good for maintenance) ; b) you make the whole script appear in a vignette (see above). With this method, people just have to write a single R file (which can be copy-pasted from the vignette), which may look like this :
library(mydatascienceproject)
library(...)
...
dothis()
dothat()
finishwork()
That enables you to execute the whole work from a terminal or a distant machine with Rscript, with the following (using argparse to add arguments)
Rscript myautomatedtask.R --arg1 anargument --arg2 anotherargument
And finally if you write a bash file calling Rscript, you can automate everything !
Feel free to read Hadley Wickham's book about R packages, it is super clear, full of best practices and of great help in writing your packages.
One can get lost in the multiple files in the project's folder, so it should be structured properly: link
Naming conventions that I use: first, second.
Set up the random seed, so the outputs should be reproducible.
Documentation is important: you can use the Roxygen skeleton in rstudio (default ctrl+alt+shift+r).
I usually separate the code into smaller, logically cohesive scripts, and use a main.R script, that uses the others.
If you use a special set of libraries, you can consider using packrat. Once you set it up, you can manage the installed project-specific libraries.
I have been working for a few months on a project, and there are now about 120 functions. About one third of these functions are no longer necessary to run the final version the script. The functions are all loaded from a function folder into a separate namespace.
Is there a way in R to see which functions are called?
The alternative is to go to the script and write down the functions that are now actually used.
I have lots of .r scripts that I want to source all. I have written a function like the one below to source.
sourcer=function(){
source("wil.r")
source("k.r")
source("l.r")
}
Please can any one tell me how to get this codes activated and how to call each one any time I want to use it?
In addition to the answer by #user2885462, if the amount of R code you need to source becomes bigger, you might want to wrap the code into an R package. This provides a convenient way of loading the code, and allows you to add tests, documentation, etc. Reading the official package writing tutorial is a good place to start for that.
For an individual project, I like to have all (or most) of my R functions in separate .r files, all in the same folder: e.g., AllFunctions
Then at the beginning of my main code I run the following line of code, which sources all .r (and other extensions if they exist - which they usually don't) in the AllFunctions folder:
for (nm in list.files("AllFunctions", pattern = ".[RrSsQq]$")) source(file.path("AllFunctions", nm))
My aim is to better organize the work done by a R code.
In particular it could be useful to split the R code I have written in different R files, perhaps with each R file accomplishing to a different task. I have in mind what we can do in Matlab with different M files, where we can easily call functions written in different M files directly from the main code.
Is it useful to write this R files in the form of functions?
How can we call these R files /functions in the main code?
Thanks
You can use source("filename.R") to include the file in your main script.
I am not sure if there is a ready function to include an entire directory, but it is straightforward to write using list.files() and then call source dynamicly for each filename. You can also filter files to only list *.R for example.
Unless you intend to write an R package, you should rethink your organization. R is not Matlab, thank goodness! You can place as many functions as you wish into a single file, and make them all available in your environment with source foo.r . If you are writing a collection of generic functions and don't want to build a package, this really is the cleaner way to go.
As a side thought, consider making some of your tools more flexible by adding more input arguments. You may find that you don't really need so many separate functions/files. As a trivial example, if you have some function do_it_double , another do_it_integer , and yet another do_it_character , all of which do basically the same thing, just merge them into a single do_it_all(x,y,datatype='double') and override the default datatype as desired. (I know this can be done with internal input validation. I'm just giving an example)
Your approach might be working good. I would recommend to wrap the code in a function and use one R file for one R function.
It might be interesting to look at the packages devtools and ProjectTemplate which aim to help organizing R code.