Retain valid workspace reference after project transfer. - r

I've been working on a R project (projectA) that I want to hand over to a colleague, what would be the best way to handle workspace references in the scripts? To illustrate, let's say projectA consists of several R scripts that each read input and write output to certain directories (dirs). All dirs are contained within my local dropbox. The I/O part of the scripts look as follows:
# Script 1.
# Give input and output names and dirs:
dat1Dir <- "D:/Dropbox/ProjectA/source1/"
dat1In <- "foo1.asc"
dat2Dir <- "D:/Dropbox/ProjectA/source2/"
dat2In <- "foo2.asc"
outDir <- "D:/Dropbox/ProjectA/output1/"
outName <- "fooOut1.asc"
# Read data
setwd(dat1Dir)
dat1 <- read.table(dat1In)
setwd(dat2Dir)
dat2 <- read.table(dat2In)
# do stuff with dat1 and dat2 that result in new data foo
# Write new data foo to file
setwd(outDir)
write.table(foo, outName)
# Script 2.
# Give input and output names and dirs
dat1Dir <- "D:/Dropbox/ProjectA/output1/"
dat1In <- "fooOut1.asc"
outDir <- "D:/Dropbox/ProjectA/output2/"
outName <- "fooOut2.asc"
Etc. Each script reads and write data from/to file and subsequent scripts read the output of previous scripts. The question is: how can I ensure that the directory-strings remain valid after transfer to another user?
Let's say we copy the ProjectA folder, including subfolders, to another PC, where it is stored at, e.g., C:/Users/foo/my documents/. Ideally, I would have a function FindDir() that finds the location of the lowest common folder in the project, here "ProjectA", so that I can replace every directory string with:
dat1Dir <- paste(FindDir(), "ProjectA/source1", sep= "")
So that:
# At my own PC
dat1Dir <- paste(FindDir(), "ProjectA/source1", sep= "")
> "D:/Dropbox/ProjectA/source1/"
# At my colleagues PC
dat1Dir <- paste(FindDir(), "ProjectA/source1", sep= "")
> "C:Users/foo/my documents/ProjectA/source1/"
Or perhaps there is a different way? Our work IT infrastructure currently does not allow using a shared disc. I'll put helper-functions in an 'official' R project (ie, hosted on R forge), but I'd like to use scripts when many I/O parameters are required and because the code can easily be viewed and commented.
Many thanks in advance!

You should be able to do this by using relative directory paths. This is what I do for my R projects that I have in Dropbox and that I edit/run on both my Windows and OS X machines where the Dropbox folder is D:/Dropbox and /Users/robin/Dropbox respectively.
To do this, you'll need to
Set the current working directory in R (either in the first line of your script, or interactively at the console before running), using setwd('/Users/robin/Dropbox;) (see the full docs for that command).
Change your paths to relative paths, which mean they just have the bit of the path from the current directory, in this case the 'ProjectA/source1' bit if you've set your current directory to your Dropbox folder, or just 'source1' if you've set your current directory to the ProjectA folder (which is a better idea).
Then everything should just work!
You may also be interested in an R library that I love called ProjectTemplate - it gives you really nice functionality for making self-contained projects for this sort of work in R, and they're entirely reproducible, moveable between computers and so on. I've written an introductory blog post which may be useful.

Related

file.copy: overcoming Windows long path/filename limitation

I have a table in R containing many files that I need copied to a destination folder. The files are spread out over dozens of folders, each several sub-folders down. I have successfully used the following code to find all of the files and their locations:
(fastq_files <- list.files(Illumina_output, ".fastq.gz", recursive = TRUE, include.dirs = TRUE) %>% as_tibble)
After appending the full path, I have a tibble that looks something like this:
full_path
Q:/IlluminaOutput/2019/091119 AB NGS/Data/Intensities/BaseCalls/19-15897-HLA-091119-AB-NGS_S14_L001_R1_001.fastq.gz
Q:/IlluminaOutput/2019/091119 AB NGS/Data/Intensities/BaseCalls/19-15236-HLA-091119-AB-NGS_S14_L001_R2_001.fastq.gz
Q:/IlluminaOutput/2018/062818AB NGS/Data/Intensities/BaseCalls/18-06875-HLA-062818-NGS_S11_L001_R1_001.fastq.gz
Using the file.copy function gives an error that the file name is too long, a known issue in Windows (I am using RStudio on Windows 10).
I found that if I set the working directory directory to the file location, I am able to copy files. Starting with a table like this:
file
path
19-14889-HLA-091119-AB-NGS_S14_L001_R1_001.fastq.gz
Q:/IlluminaOutput/2019/091119 AB NGS/Data/Intensities/BaseCalls/
19-14889-HLA-091119-AB-NGS_S14_L001_R2_001.fastq.gz
Q:/IlluminaOutput/2019/091119 AB NGS/Data/Intensities/BaseCalls/
18-09772-HLA-062818-NGS_S11_L001_R1_001.fastq.gz
Q:/IlluminaOutput/2018/062818AB NGS/Data/Intensities/BaseCalls/
18-09772-HLA-062818-NGS_S11_L001_R2_001.fastq.gz
Q:/IlluminaOutput/2018/062818AB NGS/Data/Intensities/BaseCalls/
I used the following code to sucsessfully copy the first file:
(dir <- as.character(as.vector(file_and_path[1,2])))
setwd(dir)
(file <- as.character(as.vector(file_and_path[1,1])))
(file.copy(file, Trusight_output) %>% as.tibble)
I got this to work, but I don't know how to apply these steps to every column in my table. I think i probably have to use the lapply function, but I'm not sure how to construct it.
This should do the trick, assuming that file_and_path$file and file_and_path$path are both character vectors and that Trusight_output is an absolute path:
f <- function(file, from, to) {
cwd <- setwd(from)
on.exit(setwd(cwd))
file.copy(file, to)
}
Map(f, file = file_and_path$file, from = file_and_path$path, to = Trusight_output)
We use Map here rather than lapply because we are applying a function of more than one argument. FWIW, operations like this are often better suited for PowerShell.

Is there a way to create a code architecture diagram, that gives an overview over R scripts that source each other?

I have alot of different scripts in R that sources one another with source(). Im looking for a way to create an overview diagram, that links each script visually, so i can easily see the "source hierarchy" of my code.
The result could look something like:
I hope there is a solution, that doesnt require a software license.
Hope it makes sence! :)
I can suggest you use Knime. it has the kind of diagram you are looking for. It has some scripts already wrote to clean, visualize data and write output and has integration with R and Python.
https://docs.knime.com/?category=integrations&release=2019-12
https://www.knime.com/
Good luck.
For purposes of example change directory to an empty directory and run the code in the Note at the end to create some sample .R files.
In the first two lines of the code below we set the files variable to be a character vector containing the paths to the R files of interest. We also set st to the path to the main source file. Here it is a.R but it can be changed appropriately.
The code first inserts the line contained in variable insert at the beginning of each such file.
Then it instruments source using the trace command shown so that each time source is run a log record is produced. We then source the top level R file.
Finally we read in the log and use the igraph package to produce a tree of source files. (Any other package that can produce suitable graphics could be used instead.)
# change the next two lines of code appropriately.
# Settings shown are for the files generated in the Note at the end
# assuming they are in the current directory and no other R files are.
files <- Sys.glob("*.R")
st <- "a.R"
# inserts indicated line at top of each file unless already inserted
insert <- "this.file <- normalizePath(sys.frames()[[1]]$ofile)"
for(f in files)
inp <- readLines(f)
ok <- !any(grepl(insert, inp, fixed = TRUE)) # TRUE if insert not in f
if (ok) writeLines(c(insert, input), f)
}
# instrument source and run to produce log file
if (file.exists("log")) file.remove("log")
this.file <- "root"
trace(source, quote(cat("parent:", basename(this.file),
"file:", file, "\n", file = "log", append = TRUE)))
source(st) # assuming a.R is the top level program
untrace(source)
# read log and display graph
DF <- read.table("log")[c(2, 4)]
library(igraph)
g <- graph.data.frame(DF)
plot(g, layout = layout_as_tree(g))
For example, if we have the files generated in the Note at the end then the code above generates this diagram:
Note
cat('
source("b.R")
source("c.R")
', file = "a.R")
cat("\n", file = "b.R")
cat("\n", file = "C.R")

How to pass several files to a command function and output them to different files?

I have a directory made of 50 files, here's an excerpt about how the files are names :
input1.txt
input2.txt
input3.txt
input4.txt
I'm writing the script in R but I'm using bash commands inside it using "system"
I have a system command X that takes one file and outputs it to one file
example :
X input1.txt output1.txt
I want input1.txt to output to output1.txt, input2.txt to output to output2.txt etc..
I've been trying this:
for(i in 1:50)
{
setwd("outputdir");
create.file(paste("output",i,".txt",sep=""));
setwd("homedir");
system(paste("/usr/local/bin/command" , paste("input",i,".txt",sep=""),paste("/outputdir/output",i,".txt",sep="")));
}
What am I doing wrong? I'm getting an error at the line of system , it says incorrect string constant , I don't get it.. Did I apply the system command in a wrong manner?
Is there a way to get all the input files and output files without going through the paste command to get them inside system?
There is a pretty easy method in R to copy files to a new directory without using the system commands. This also has the benefit of being cross-capable on different operating systems (just have to change the files structures).
Modified code from: "Copying files with R" by Amy Whitehead
Using your method of running for files 1:50 I have some psudocode here. You will need to change the current.folder and new.folder to something
# identify the folders
current.folder <- "/usr/local/bin/command"
new.folder <- "/outputdir/output"
# find the files that you want
i <- 1:50
# Instead of looping we can use vector pasting to get multiple results at once!
inputfiles <- paste0(current.folder,"/input",i,".txt")
outputfiles <- paste0(new.folder,"/output",i,".txt")
# copy the files to the new folder
file.copy(inputfiles, outputfiles)

R: Improving workflow and keeping track of output

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.

read .csv file with unknown path -- R

I know this might be very stupid question but I have been spending hours on this
want to read a .csv file that I dont have its full path (*/*data.csv). I know that following would get the path of the current directory but don't know how to adapt
Marks <- read.csv(dir(path = '.', full.names=T, pattern='^data.*\\.csv'))
tried this one as well but not working
Marks <- read.csv(file = "*/*/data.csv", sep = ",", header=FALSE))
I can't identify specific path as this will be used on different machines with different paths but I am sure about the sub-folders of the main directory as they are result of a bash script
and I am planing to call this from within unix which defines the workspace
my data structure is
lecture01/test/data.csv
lecture02/test/data.csv
lecture03/test/data.csv
Your comments -- though not currently your question itself -- indicate you expect to run your code in a working directory that contains some number of subdirectories (lecture01, lecture02, etc), each of which contain a subdirectory 'marks' that in turn contains a data.csv file. If this is so, and your objective is to read the csv from within each subdirectory, then you have a couple of options depending on the remaining details.
Case 1: Specify the top-level directory names directly, if you know them all and they are potentially idiosyncratic:
dirs <- c("lecture01", "lecture02", "some_other_dir")
paths <- file.path(dirs, "marks/data.csv")
Case 2: Construct the top-level directory names, e.g. if they all start with "lecture", followed by a two digit number, and you are able to (or specifically wish to) specify a numeric range, e.g. 01 though 15:
dirs <- sprintf("lecture%02s", 1:15)
paths <- file.path(dirs, "marks/data.csv")
Case 3: Determine the top-level directory names by matching a pattern, e.g. if you want to read data from within every directory starting with the string "lecture":
matched.names <- list.files(".", pattern="^lecture")
dirs <- matched.names[file.info(matched.names)$isdir]
paths <- file.path(dirs, "marks/data.csv")
Once you have a vector of the paths, I'd probably use lapply to read the data into a list for further processing, naming each one with the base directory name:
csv.data <- lapply(paths, read.csv)
names(csv.data) <- dirs
Alternatively, if whatever processing you do on each individual CSV is done just for its side effects, such as modifying the data and writing out a new version, and especially if you don't ever want all of them to be in memory at the same time, then use a loop.
If this answer misses the mark, of even if it doesn't, it would be great if you could clarify the question accordingly.
I have no code but I would do a reclusive glob from the root and do a preg_match to find the .csv file (use glob brace).

Resources