Sourcing files using a relative path is useful when dealing with large codebases. Other programming languages have well-defined mechanisms for sourcing files using a path relative to the directory of the file being sourced into. An example is Ruby's require_relative. What is a good way to implement relative path sourcing in R?
Below is what I pieced together a while back using various recipes and R forum posts. It's worked well for me for straight development but is not robust. For example, it breaks when the files are loaded via the testthat library, specifically auto_test(). rscript_stack() returns character(0).
# Returns the stack of RScript files
rscript_stack <- function() {
Filter(Negate(is.null), lapply(sys.frames(), function(x) x$ofile))
}
# Returns the current RScript file path
rscript_current <- function() {
stack <- rscript_stack()
r <- as.character(stack[length(stack)])
first_char <- substring(r, 1, 1)
if (first_char != '~' && first_char != .Platform$file.sep) {
r <- file.path(getwd(), r)
}
r
}
# Sources relative to the current script
source_relative <- function(relative_path, ...) {
source(file.path(dirname(rscript_current()), relative_path), ...)
}
Do you know of a better source_relative implementation?
After a discussion with #hadley on GitHub, I realized that my question goes against the common development patterns in R.
It seems that in R files that are sourced often assume that the working directory (getwd()) is set to the directory they are in. To make this work, source has a chdir argument whose default value is FALSE. When set to TRUE, it will change the working directory to the directory of the file being sourced.
In summary:
Assume that source is always relative because the working directory of the file being sourced is set to the directory where the file is.
To make this work, always set chdir=T when you source files from another directory, e.g., source('lib/stats/big_stats.R', chdir=T).
For convenient sourcing of entire directories in a predictable way I wrote sourceDir, which sources files in a directory in alphabetical order.
sourceDir <- function (path, pattern = "\\.[rR]$", env = NULL, chdir = TRUE)
{
files <- sort(dir(path, pattern, full.names = TRUE))
lapply(files, source, chdir = chdir)
}
Related
I have a table in R containing many files that I need copied to a destination folder. The files are spread out over dozens of folders, each several sub-folders down. I have successfully used the following code to find all of the files and their locations:
(fastq_files <- list.files(Illumina_output, ".fastq.gz", recursive = TRUE, include.dirs = TRUE) %>% as_tibble)
After appending the full path, I have a tibble that looks something like this:
full_path
Q:/IlluminaOutput/2019/091119 AB NGS/Data/Intensities/BaseCalls/19-15897-HLA-091119-AB-NGS_S14_L001_R1_001.fastq.gz
Q:/IlluminaOutput/2019/091119 AB NGS/Data/Intensities/BaseCalls/19-15236-HLA-091119-AB-NGS_S14_L001_R2_001.fastq.gz
Q:/IlluminaOutput/2018/062818AB NGS/Data/Intensities/BaseCalls/18-06875-HLA-062818-NGS_S11_L001_R1_001.fastq.gz
Using the file.copy function gives an error that the file name is too long, a known issue in Windows (I am using RStudio on Windows 10).
I found that if I set the working directory directory to the file location, I am able to copy files. Starting with a table like this:
file
path
19-14889-HLA-091119-AB-NGS_S14_L001_R1_001.fastq.gz
Q:/IlluminaOutput/2019/091119 AB NGS/Data/Intensities/BaseCalls/
19-14889-HLA-091119-AB-NGS_S14_L001_R2_001.fastq.gz
Q:/IlluminaOutput/2019/091119 AB NGS/Data/Intensities/BaseCalls/
18-09772-HLA-062818-NGS_S11_L001_R1_001.fastq.gz
Q:/IlluminaOutput/2018/062818AB NGS/Data/Intensities/BaseCalls/
18-09772-HLA-062818-NGS_S11_L001_R2_001.fastq.gz
Q:/IlluminaOutput/2018/062818AB NGS/Data/Intensities/BaseCalls/
I used the following code to sucsessfully copy the first file:
(dir <- as.character(as.vector(file_and_path[1,2])))
setwd(dir)
(file <- as.character(as.vector(file_and_path[1,1])))
(file.copy(file, Trusight_output) %>% as.tibble)
I got this to work, but I don't know how to apply these steps to every column in my table. I think i probably have to use the lapply function, but I'm not sure how to construct it.
This should do the trick, assuming that file_and_path$file and file_and_path$path are both character vectors and that Trusight_output is an absolute path:
f <- function(file, from, to) {
cwd <- setwd(from)
on.exit(setwd(cwd))
file.copy(file, to)
}
Map(f, file = file_and_path$file, from = file_and_path$path, to = Trusight_output)
We use Map here rather than lapply because we are applying a function of more than one argument. FWIW, operations like this are often better suited for PowerShell.
I would like to set the working directory to the path of current script programmatically but first I need to get the path of current script.
So I would like to be able to do:
current_path = ...retrieve the path of current script ...
setwd(current_path)
Just like the RStudio menu does:
So far I tried:
initial.options <- commandArgs(trailingOnly = FALSE)
file.arg.name <- "--file="
script.name <- sub(file.arg.name, "", initial.options[grep(file.arg.name, initial.options)])
script.basename <- dirname(script.name)
script.name returns NULL
source("script.R", chdir = TRUE)
Returns:
Error in file(filename, "r", encoding = encoding) : cannot open the
connection In addition: Warning message: In file(filename, "r",
encoding = encoding) : cannot open file '/script.R': No such file or
directory
dirname(parent.frame(2)$ofile)
Returns: Error in dirname(parent.frame(2)$ofile) : a character vector argument expected
...because parent.frame is null
frame_files <- lapply(sys.frames(), function(x) x$ofile)
frame_files <- Filter(Negate(is.null), frame_files)
PATH <- dirname(frame_files[[length(frame_files)]])
Returns: Null because frame_files is a list of 0
thisFile <- function() {
cmdArgs <- commandArgs(trailingOnly = FALSE)
needle <- "--file="
match <- grep(needle, cmdArgs)
if (length(match) > 0) {
# Rscript
return(normalizePath(sub(needle, "", cmdArgs[match])))
} else {
# 'source'd via R console
return(normalizePath(sys.frames()[[1]]$ofile))
}
}
Returns: Error in path.expand(path) : invalid 'path' argument
Also I saw all answers from here, here, here and here.
No joy.
Working with RStudio 1.1.383
EDIT: It would be great if there was no need for an external library to achieve this.
In RStudio, you can get the path to the file currently shown in the source pane using
rstudioapi::getSourceEditorContext()$path
If you only want the directory, use
dirname(rstudioapi::getSourceEditorContext()$path)
If you want the name of the file that's been run by source(filename), that's a little harder. You need to look for the variable srcfile somewhere back in the stack. How far back depends on how you write things, but it's around 4 steps back: for example,
fi <- tempfile()
writeLines("f()", fi)
f <- function() print(sys.frame(-4)$srcfile)
source(fi)
fi
should print the same thing on the last two lines.
Update March 2019
Based on Alexis Lucattini and user2554330 answers, to make it work on both command line and RStudio. Also solving the "as_tibble" deprecated message
library(tidyverse)
getCurrentFileLocation <- function()
{
this_file <- commandArgs() %>%
tibble::enframe(name = NULL) %>%
tidyr::separate(col=value, into=c("key", "value"), sep="=", fill='right') %>%
dplyr::filter(key == "--file") %>%
dplyr::pull(value)
if (length(this_file)==0)
{
this_file <- rstudioapi::getSourceEditorContext()$path
}
return(dirname(this_file))
}
TLDR: The here package (available on CRAN) helps you build a path from a project's root directory. R projects configured with here() can be shared with colleagues working on different laptops or servers and paths built relative to the project's root directory will still work. The development version is at github.com/r-lib/here.
With git
You certainly store your R code in a directory. This directory is probably part of a git repository and/or an R studio project. I would recommend building all paths relative to that project's root directory. For example let's say that you have an R script that creates reusable plotting functions and that you have an R markdown notebook that loads that script and plots graphs in a nice (so nice) document. The project tree would look something like this
├── notebooks
│ ├── analysis.Rmd
├── R
│ ├── prepare_data.R
│ ├── prepare_figures.R
From the analysis.Rmd notebook, you would import plotting function with here() as such:
source(file.path(here::here("R"), "prepare_figures.R"))
Why?
Hadley Wickham in a Stackoverflow
comment:
"You should never use setwd() in R code - it basically defeats the idea of
using a working directory because you can no longer easily move your code
between computers. – hadley Nov 20 '10 at 23:44 "
From the Ode to the here package:
Do you:
Have setwd() in your scripts? PLEASE STOP DOING THAT.
This makes your script very fragile, hard-wired to exactly one time and place. As soon as you rename or move directories, it breaks. Or maybe you get a new computer? Or maybe someone else needs to run your code?
[...]
Classic problem presentation: Awkwardness around building paths and/or setting working directory in projects with subdirectories. Especially if you use R Markdown and knitr, which trips up alot of people with its default behavior of “working directory = directory where this file lives”. [...]
Install the here package:
install.packages("here")
library(here)
here()
here("construct","a","path")
Documentation of the here() function:
Starting with the current working directory during package load time,
here will walk the directory hierarchy upwards until it finds
a directory that satisfies at least one of the following conditions:
contains a file matching [.]Rproj$ with contents matching ^Version: in
the first line
[... other options ...]
contains a directory .git
Once established, the root directory doesn't change during the active
R session. here() then appends the arguments to the root directory.
The development version of the here package is available on github.
What about
What about files outside the project directory?
If you are loading or sourcing files outside the project directory, the recommended way is to use an environment variable at the Operating System level. Other users of your R code on different laptops or servers would need to set the same environment variable. The advantage is that it is portable.
data_path <- Sys.getenv("PROJECT_DATA")
df <- read.csv(file.path(data_path, "file_name.csv"))
Note: There is a long list of environmental variables which can affect an R session.
What about many projects sourcing each other?
It's time to create an R package.
If you're running an Rscript through the command-line etc
Rscript /path/to/script.R
The function below will assign this_file to /path/to/script
library(tidyverse)
get_this_file <- function() {
commandArgs() %>%
tibble::enframe(name = NULL) %>%
tidyr::separate(
col = value, into = c("key", "value"), sep = "=", fill = "right"
) %>%
dplyr::filter(key == "--file") %>%
dplyr::pull(value)
}
this_file <- get_this_file()
print(this_file)
Here is a custom function to obtain the path of a file in R, RStudio, or from an Rscript:
stub <- function() {}
thisPath <- function() {
cmdArgs <- commandArgs(trailingOnly = FALSE)
if (length(grep("^-f$", cmdArgs)) > 0) {
# R console option
normalizePath(dirname(cmdArgs[grep("^-f", cmdArgs) + 1]))[1]
} else if (length(grep("^--file=", cmdArgs)) > 0) {
# Rscript/R console option
scriptPath <- normalizePath(dirname(sub("^--file=", "", cmdArgs[grep("^--file=", cmdArgs)])))[1]
} else if (Sys.getenv("RSTUDIO") == "1") {
# RStudio
dirname(rstudioapi::getSourceEditorContext()$path)
} else if (is.null(attr(stub, "srcref")) == FALSE) {
# 'source'd via R console
dirname(normalizePath(attr(attr(stub, "srcref"), "srcfile")$filename))
} else {
stop("Cannot find file path")
}
}
https://gist.github.com/jasonsychau/ff6bc78a33bf3fd1c6bd4fa78bbf42e7
Another option to get current script path is funr::get_script_path() and you don't need run your script using RStudio.
I had trouble with all of these because they rely on libraries that I couldn't use (because of packrat) until after setting the working directory (which was why I needed to get the path to begin with).
So, here's an approach that just uses base R. (EDITED to handle windows \ characters in addition to / in paths)
args = commandArgs()
scriptName = args[substr(args,1,7) == '--file=']
if (length(scriptName) == 0) {
scriptName <- rstudioapi::getSourceEditorContext()$path
} else {
scriptName <- substr(scriptName, 8, nchar(scriptName))
}
pathName = substr(
scriptName,
1,
nchar(scriptName) - nchar(strsplit(scriptName, '.*[/|\\]')[[1]][2])
)
If you don't want to use (or have to remember) code, simply hover over the script and the path will appear
The following solves the problem for three cases: RStudio source Button, RStudio R console (source(...), if the file is still in the Source pane) or the OS console via Rscript:
this_file = gsub("--file=", "", commandArgs()[grepl("--file", commandArgs())])
if (length(this_file) > 0){
wd <- paste(head(strsplit(this_file, '[/|\\]')[[1]], -1), collapse = .Platform$file.sep)
}else{
wd <- dirname(rstudioapi::getSourceEditorContext()$path)
}
print(wd)
The following code gives the directory of the running Rscript if you are running it either from Rstudio or from the command line using Rscript command:
if (rstudioapi::isAvailable()) {
if (require('rstudioapi') != TRUE) {
install.packages('rstudioapi')
}else{
library(rstudioapi) # load it
}
wdir <- dirname(getActiveDocumentContext()$path)
}else{
wdir <- getwd()
}
setwd(wdir)
I have a 370MB zip file and the content is a 4.2GB csv file.
I did:
unzip("year2015.zip", exdir = "csv_folder")
And I got this message:
1: In unzip("year2015.zip", exdir = "csv_folder") :
possible truncation of >= 4GB file
Have you experienced that before? How did you solve it?
I agree with #Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.
To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.
I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.
decompress_file <- function(directory, file, .file_cache = FALSE) {
if (.file_cache == TRUE) {
print("decompression skipped")
} else {
# Set working directory for decompression
# simplifies unzip directory location behavior
wd <- getwd()
setwd(directory)
# Run decompression
decompression <-
system2("unzip",
args = c("-o", # include override flag
file),
stdout = TRUE)
# uncomment to delete archive once decompressed
# file.remove(file)
# Reset working directory
setwd(wd); rm(wd)
# Test for success criteria
# change the search depending on
# your implementation
if (grepl("Warning message", tail(decompression, 1))) {
print(decompression)
}
}
}
Notes:
The function does a few things, which I like and recommend:
uses system2 over system because the documentation says "system2 is a more portable and flexible interface than system"
separates the directory and file arguments, and moves the working directory to the directory argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory
it's not pure, but resetting the working directory is a nice step toward the function having fewer side effects
you can technically do it without this, but in my experience it's easier to make the function more verbose than have to deal with generating filepaths and remembering unzip CLI flags
I set it to use the -o flag to automatically overwrite when rerun, but you could supply any number of arguments
includes a .file_cache argument which allows you to skip decompression
this comes in handy if you're testing a process which runs on the decompressed file, since 4GB+ files tend to take some time to decompress
commented out in this instance, but if you know you don't need the archive after decompressing, you can remove it inline
the system2 command redirects the stdout to decompression, a character vector
an if + grepl check at the end looks for warnings in the stdout, and prints the stdout if it finds that expression
Checking ?unzip, found the following comment in Note:
It does have some support for bzip2 compression and > 2GB zip files
(but not >= 4GB files pre-compression contained in a zip file: like
many builds of unzip it may truncate these, in R's case with a warning
if possible).
You can try to unzip it outside of R (using 7-Zip for example).
To add to the list of possible solutions, in case you have Java (JDK) available on your machine, you can wrap jar xf into an R function similar to utils::unzip() in interface, a very simple example:
unzipLarge <- function(zipfile, exdir = getwd()) {
oldWd <- getwd()
on.exit(setwd(oldWd))
setwd(exdir)
system2("jar", args = c("xf", zipfile))
}
And then use:
unzipLarge("year2015.zip", exdir = "csv_folder")
I am creating a R package, and would like to organize my R subdirectory, with subdirectories. Since only the function defined in R files at the root directory are exported, I added this code to one file at the root:
sourceDir <- function(path, trace = TRUE, ...) {
for (nm in list.files(path, pattern = "\\.[RrSsQq]$")) {
print(nm)
if(trace) cat(nm,":")
source(file.path(path, nm), ...)
if(trace) cat("\n")
}
}
sourceDir("R/DataGenerator")
When I use "CRTL+SHIFT+B" on RStudio, I see that the nm files are sourced. But once the package is loaded, none of the functions defined in the subdirectory R/DataGenerator are accessible, neither using :: nor using ::: .
How can I export functions defined in subdirectories of R ? Is it even possible ?
As indicated in the discussion in the comments to the accepted answer between Martin Morgan and me, this does not seem to work in current R versions. My workaround to get a bit better file organisation is to prefix the filenames using what would have been the subdirectories names.
Use the Collate: field in the DESCRIPTION file to specify paths to files to be included
Collate: foo.R bar/baz.R
A helper to generate the collate line might be something like
fls = paste(dir(pattern="R", recursive=TRUE), collapse=" ")
cat(strwrap(sprintf("Collate: %s", fls), exdent=4), sep="\n")
In all other cases, when I am working within an RStudio project, I can make references relative to the project root in scripts. So I can, for example, dfX = read.csv("Data/somefile.csv"), where the folder Data is relative to my project root.
The same code in a knitr chunk does not find the file. I guess this is because knitr creates a bunch of temporary directories that it needs to refer to relative to the file location. Is there an easy way to change this behavior? Obviously, I would not like to add the entire path to the project folder -- I am aware that I can easily do this using knitr::opts_knit$set(root.dir = rootPath). That completely breaks maintainability across machines and OSs.
Edit: This seems closely linked to this question.
Presumably you know the path to the package directory when you call 'knit', so how about:
ENV <- new.env()
assign("workingDirectory", getcwd(), envir = ENV)
knitr::knit(...,
# THE ENVIRONMENT IN WHICH THE CODE CHUNKS ARE TO BE EVALUATED
envir=ENV)
Then in your rmd file you can do:
```{r] print(workingDirectory)```
If you're searching for the location of the current install, you can use:
PATH = NULL
for(libPath in .libPaths())
if('myPackage' %in% list.dirs(libPath,FALSE,FALSE)){
PATH = file.path(libPath,'myPackage')
}
if(is.null(PATH))
stop('could not find package directory')
ENV <- new.env()
assign("workingDirectory", PATH, envir = ENV)
knitr::knit(...,
# THE ENVIRONMENT IN WHICH THE CODE CHUNKS ARE TO BE EVALUATED
envir=ENV)
My guess is that the document that you are "knitting" is in a subdirectory itself. It seems that, when you click "Knit PDF", RStudio or knitr will setwd() to the directory containing the file being knitted. So you may need to do something like dfX = read.csv("../Data/somefile.csv") to get the reference right.
I have a working example here.