I have several custom functions that I use frequently in R. Rather than souce this file (or parts thereof) in each script, is there some way to add this to a base R file such that they are always available when I use R?
Yes, create a package. There are numerous tutorials as well as the Writing R Extensions manual that came with your copy of R.
It may seem like too much work at first, but you will probably be glad that you did this in the longer run.
PS And you can then load that package from ~/.Rprofile. For really short code, you can also define it there.
A package may be overkill for a for a few useful functions. I'd argue there's nothing wrong with explicitly source()ing them as you need them - at least it is explicit so that if you email someone your code, you won't forget to include those other scripts.
Another option is to use the .Rprofile file. You can read about the details in ?Startup. Basically, the idea is that:
...a file called ‘.Rprofile’ is searched for in the current directory or
in the user's home directory (in that order). The user profile file is
sourced into the workspace.
You can read here about how many people use this functionality.
The accepted answer is best long-term: Make a package.
Luckily, the learning curve for doing this has been dramatically reduced by the devtools package: It automates package creation (a nice assist in getting off on the right foot), encourages good practices (like documenting with roxygen2, and helps with using online version control (bitbucket, github or other), sharing your package with others. It's also very helpful for smoothing your way to CRAN submission.
Good docs at http://adv-r.had.co.nz and http://r-pkgs.had.co.nz .
to create your package, for instance you can:
install.packages("devtools")
devtools::create("path/to/package/pkgname")
You could also look at the 'mvbutils' package: it lets you set up a hierarchical set of "tasks" (folders with workspace ".RData" files in them) such that you can always see what's in the ancestral tasks (ie the ancestors are in the search() path). So you can put your custom functions in the "starting task" where you always start R; and then you change to vwhatever project-specific task you require, so you can avoid cluttered workspaces, but you'll still be able to use (and edit) your custom functions because the starting task is always ancestral. Objects (including functions) get stored in ".RData" files and are thus loaded/saved automatically, but there are separate text-backup facilities for functions.
There are lots of different ways of working in R, and no "one-size-fits-all" best solution. It's also not easy to find an overview! Speaking just for myself:
I'm not a fan of having to 'source' everything in every time; for one thing, it simply doesn't work with big data sets and/or results of model runs.
I think packages are hard to create and maintain; there is a really significant overhead. After the first 5 packages you write, it does get a bit easier provided you do it on at least a weekly basis so you don't forget how, but really...
In fact, 'mvbutils' also has a bunch of tools for facilitating the creation and (especially) maintenance of packages, designed to interface smoothly with the task-hierarchy system. I use & edit my own packages all the time (including editing mvbutils itself); but if it wasn't for the tools in 'mvbutils', I'd be grinding my teeth in frustration most days of the week.
Related
Tl;dr
I have a complete NAMESPACE file for the package I am building. I want to execute all the importFrom(x,y) clauses in that file to get the set of necessary functions I use in the package. Is there an automated way to do that?
Full question
I'm currently working on building a package that in turns depends on a bunch of other packages. Think: both Hmisc and dplyr.
The other contributors to the package don't really like using devtools::build() every 5 minutes when they debug, which I can understand. They want to be able to toggle a dev_mode boolean which, when set to true, loads all dependencies and sources all scripts in the R/ and data-raw/ folders, so that all functions and data objects are loaded in memory. I don't necessarily think that's a perfectly clean solution, but they're technically my clients and it's a big help to them so please don't offer a frame challenge unless it's the most user-friendly solution imaginable.
The problem with this solution is that when I load two libraries whose namespace clash, functions in the package that would perfectly work otherwise start throwing errors. I thus need to improve my import system.
Thanks to thorough documentation (and the help of devtools::check()), my NAMESPACE is complete. I guess I could split it in pieces and eval() some well-chosen parts of it, but it sort of feels like I'm probably not the first person to encounter this difficulty. Is there an automated way to parse NAMESPACE and import the proper functions from the proper packages ?
Thanks !
I would like to use R objects (e.g., cleaned data) generated in one git-versioned R project in another git-versioned R project.
Specifically, I have multiple git-versioned R projects (that hold drake plans) that do various things for my thesis experiments (e.g., generate materials, import and clean data, generate reports/articles).
The experiment-specific projects should ideally be:
Connectable - so that I can get objects (mainly data and materials) that I generated in these projects into another git-versioned R project that generates my thesis report.
Self-contained - so that I can use them in other non-thesis projects (such as presentations, reports, and journal manuscripts). When sharing such projects, I'd ideally like not to need to share a monolithic thesis project.
Versioned - so that their use in different projects can be independent (e.g., if I make changes to the data cleaning for a manuscript after submitting the thesis, I still want the thesis to be reproducible as it was originally compiled).
At the moment I can see three ways of doing this:
Re-create the data cleaning process
But: this involves copy/paste, which I'd like to avoid, especially if things change upstream.
Access the relevant scripts/functions by changing the working directory
But: even if I used here it seems that this would introduce poor reproducibility.
Make the source projects into packages and make the objects I want to "export" into exported data (as per the data section of Hadley's R packages guide)
But: I'd like to avoid the unnecessary metadata, artefacts, and noise (e.g., see Miles McBain's "Project as an R package: An okay idea") if I can.
Is there any other way of doing this?
Edit: I tried #landau's suggestion of using a single drake plan, which worked well for a while, until (similar to #vrognas' case) I ended up with too many sub-projects (e.g., conference presentations and manuscripts) that relied on the same objects. Therefore, I added some clarifications above to my intentions with the question.
My first recommendation is to use a single drake plan to unite the stages of the overall project that need to share data. drake is designed to handle a lot of moving parts this way, and it will be more seamless when it comes to drake's decisions about what to rerun downstream. But if you really do need different plans in different sub-projects that share data, you can track each shared dataset as a file_out() file in one plan and track it with file_in() in another plan.
upstream_plan <- drake_plan(
export_file = write_csv(dataset, file_out("exported_data/dataset.csv"))
)
downstream_plan <- drake_plan(
dataset = read_csv(file_in("../upstream_project/exported_data/dataset.csv"))
)
You fundamentally misunderstood Miles McBain’s critique. He isn’t saying that you shouldn’t write reusable code nor that you shouldn’t use packages. He’s saying that you shouldn’t use packages for everything. But reusable code (i.e. code that you want to reuse) absolutely belongs in packages (or, better, modules), which can then be used in multiple projects.
That being said, first off, pay attention to Will Landau’s advice.
Secondly, you can make your RStudio projects configurable such that they can load data based on paths given in a configuration. Once that’s accomplished, nothing speaks against hard-coding paths to data in different projects inside that config file.
I am in a similar situation. I have many projects that are spawned from one raw dataset. Previously, when the project was young and small, I had it all in one version controlled project. This got out of hand as more sub-projects were spawned and my git history got cluttered from working on projects in parallel. This could be to my lack of skills with git. My folder structure looked something like this:
project/.git
project/main/
project/sub-project_1/
project/sub-project_2/
project/sub-project_n/
I contemplated having each project in its own git branch, but then I could not access them simultaneously. If I had to change something to the main dataset (eg I might have not cleaned some parts) then project 1 could become outdated and nonfunctional. Once I had finished project 1, I would have liked it to be isolated and contained for reproducibility. This is easier to achieve if the projects are separated. I don't think a drake/targets plan would solve this?
I also looked briefly into having the projects as git submodules but it seemed to add too much complexity. Again, my git ignorance might shine through here.
My current solution is to have the main data as an R-package, and each sub-project as a separate git-versioned folder (they are actually packages as well, but this is not necessary). This way I can load in a specific version of the data (using renv for package versions).
My folder structure now looks something like this:
main/.git
sub-project_1/.git
sub-project_2/.git
sub-project_n/.git
And inside each sub-project, I call library(main) to load the cleaned data. Within each sub-project, a drake/targets plan could be used.
Is it good practice to include every library I need to execute a function within that function?
For example, my file global.r contains several functions I need for a shiny app. Currently I have all needed packages at the top of the file. When I'm switching projects/copying these functions I have to load the packages/include them in the new code. Otherwise all needed packages are contained in that function. Of course I have to check all functions with a new R session, but I think this could help in the long run.
When I tried to load a package twice it won't load the package again but checks it's already loaded. My main question is if it would slow my functions if I restructure in that way?
I only saw that practice once, library calls inside functions, so I'm not sure.
As one of the commenters suggest, you should avoid loading packages within a function since
The function now has a global effect - as a general rule, this is something to avoid.
There is a very small performance hit.
The first point is the big one. As with most optimisation, only worry about the second point if it's an issue.
Now that we've established the principle, what are the possible solution.
In small projects, I have a file called packages.R that includes all the library calls I need. This is sourced at the top of my analysis script. BTW, all my functions are in a file call func.R. This workflow was stolen/adapted from a previous SO question
If you're only importing a single function, you could use the :: trick, e.g. package::funcA(...) That way you avoid loading the package.
For larger projects, I create an R package that handles all necessary imports. The benefit of creating a package is detailed in this answer on structuring large R projects.
I have written personal functions in R that are not specific to one (or a few) projects.
What are the best practices (in R) to put those kind of functions?
Is the best way to do it to have one file that gets sourced at startup? or is there a better (recommended) way to deal with this situation?
Create a package named "utilities" , put utility functions in that package, try to aim for one function per file, and store the package in a source control system (e.g., GIT, SVN ). It will save you time in the long run.
P.S. .Rprofile tends to get accidentally deleted.
If you have many, it would be good to make it into a package that you load each time you start working.
It is probably not a good idea to have a monolithic script with a bunch of functions. Instead break the file up into several files each of which either has only one function (my preference) or has a group of functions that are logically similar. That makes it easier to find things when you need to make changes.
Most people use the .Rprofile file for this. Here are two links which talk about this file in some detail.
http://www.statmethods.net/interface/customizing.html
http://blog.revolutionanalytics.com/2013/10/sample-rprofile.html
At the top of my .Rprofile file I call library() for the various libraries which I normally use. I also have some personal handy functions which I've come to rely on. Because this file is sourced on startup, they are available to me every session.
From my experience, a package will be the best choice for personal functions. Firstly I put all new functions into a personal package, which I called it My. When I find some functions was similar and are worth to become an independent package, I will create a new package and move them.
While I am writing .R functions I constantly need to manually write source("funcname.r") to get the changes reflected in the workspace. I am sure it must be possible to do this automatically. So what I would like would be just to make changes in my function, save the function and be able to use the new function in R workspace without manually "sourcing" this function. How can I do that?
UPDATE: I know about selecting appropriate lines of code and pressing CTRL+R in R Editor (RGui) or using Notepad++ and executing the lines into R. But this approach has a disadvantage of making my workspace console "muddled". I would like to stop this practice if at all possible.
You can use R studio which has a source on save option.
If you are prepared to package your functions into a package, you may enjoy exploring Hadley's devtools package. This provides a suite of tools to write, test and document
packages.
https://github.com/hadley/devtools
This approach offer many advantages, but mainly reloading the package with a minimum of retyping.
You will still have to type load_all("yourpackage") but I find this small amount of typing is small beer compared to the advantages of devtools.
For additional information, including how to setup devtools, have a look at https://github.com/hadley/devtools/wiki/development
If you're using Eclipse + StatET, you can press CTRL+R+S, which saves your script and sources it. As close to automatic as I can get.
If you can get your text editor to run a system command after it saves the file, then you could use something like AutoIt (on Windows) or a batch script (on UNIX-derivative) to pass a call to source off to all running copies of R. But that's a heck of a lot of work for not much gain.
Still, I think it's much more likely to work being event-driven on the text editor end vs. having R constantly scan for updates (or somehow interface with the OS's update-event-messaging-system).
This is likely not possible (automatically detecting disc changes without intervention or running at least one line).
R needs to read into memory functions, so a change on the disc wouldn't be reflected in the workspace without reloading your functions.
If you are into developing R functions, some amount of messiness during your development process will be likely inevitable, but perhaps I could suggest that you try writing an R-package to house your functions?
This has the advantage of being able to robustly document your functions, using lazy loading so that you have access to your functions/datasets immediately without sourcing them.
Don't be afraid of making a package, it's easy with package.skeleton() and doesn't have to go on CRAN but could be for your own personal use without distribution! Just have fun!
Try to accept some messiness during development knowing you are working towards your goal and fighting the good fight of code organization and documentation!
We are only imperfect people, in an imperfect world, but we mean well!