Sorry this is long, but I'm a novice and want to be specific.
I have varied numbers of dataframes within a set of directories, within a set of directories. (That's 60 inner directories, hence I'm attempting to automate this.) My goal is to list and open each outer directory; within it, list and open each inner directory; and within that, perform some simple functions with the dataframes there (average some values, etc.).
The script returns "Error in setwd(inner) : cannot change working directory", and performs the function on files in the outer directory instead, only to the first outer directory. I think the script is calling the functions in the wrong order, perhaps it's because I nested for loops such that both setwd(inner) and setwd('..') are within setwd(outer) and setwd('..'), in order to access every directory in every directory. It's not a recursion or path-name issue, because the same error results whether recursive and full.names are TRUE or FALSE in my list of directories (with list.dirs).
I've read about the downfalls of using setwd, but I'm the only analyst and don't need to share the script with other people/machines/OSs (I use RStudio in Mac OS 10.7.5). Are there better functions than setwd for analyzing all files in each directory in each directory? Or do I need to use a simpler script to work only within an inner directory, and apply it by hand individually to those 60 directories? Thank you for reading and thank you in advance for any advice you can offer!
I will use the function list.files function that ships with base r. list.files will searCH a folder recursively for files. You can also include a pattern so that the function only returns files that matches.
list.files will return the relative path to the files that you are looking for so you can read each dataframe without having to change your working directory.
I hope you will find this useful.
Let me know if you need any other help.
Cheers
Related
I have already finished with my RMarkdown and I'm trying to clean up the workspace a little bit. This isn't exactly a necessary thing but more of an organizational practice which I'm not even sure if it's a good practice, so that I can keep the data separate from some scripts and other R and git related files.
I have a bunch of .csv files for data that I used. Previously they were on (for example)
C:/Users/Documents/Project
which is what I set as my working directory. But now I want them in
C:/Users/Document/Project/Data
The problem is that this only breaks the following code because they are not in the wd.
#create one big dataframe by unioning all the data
bigfile <- vroom(list.files(pattern = "*.csv"))
I've tried adding a full path to list.files() to where the csvs are but no luck.
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv"))
Error: 'data1.csv' does not exist in current working directory ('C:/Users/Documents/Project').
Is there a way to only access the /Data folder once for creating my dataframe with vroom() instead of changing the working directory multiple times?
You can list files including those in all subdirectories (Data in particular) using list.files(pattern = "*.csv", recursive = TRUE)
Best practices
Have one directory of raw and only raw data (the stuff you measured)
Have another directory of external data (e.g. reference data bases). This is something you do can remove afterwards and redownload if required.
Have another directory for the source code
Put only the source code directory under version control plus one other file containing check sums of the raw and external data to proof integrity
Every other thing must be reproducible using raw data and the source code. This can be removed after the project. Maybe you want to keep small result files (e.g. tables) which take long time to reproduce.
You can list the files and capture the full filepath name right?
bigfile <- vroom(list.files(path = "C:/Users/Documents/Project/Data", pattern = "*.csv", full.names = T))
and that should read the file in the directory without reference to your wd
Try one of these:
# list all csv files within Data within current directory
Sys.glob("Data/*.csv")
# list all csv files within immediate subdirectories of current directory
Sys.glob("*/*.csv")
If you only have csv files then these would also work but seem less desirable. Might be useful though if you quickly want to review what files and directories are there. (I would be very careful not to use the second one within statements to delete files since if you are not in the directory you think it is in then you can wind up deleting files you did not intend to delete. The first one might too but is a bit safer since it would only lead to deleting wrong files if the directory you are in does have a Data subdirectory.)
# list all files & directories within Data within current directory
Sys.glob("Data/*")
# list all files & directories within immediate subdirectories of current directory
Sys.glob("*/*")
If the subfolder always has the same name (or the same number of characters), you should be able to do it thanks to substring. In your example, "Data" has 4 characters (5 with the /), so the following code should do:
Repository <- substring(getwd(), 1, nchar(getwd())-5)
I am trying to get started with writing my first R code. I have searched for this answer but I am not quite sure what I've found is what I'm looking for exactly. I know how to get R to read in multiple files in the same subdirectory, but I'm not quite sure how to get it to read in one specific file from multiple subdirectories.
For instance, I have a main directory containing a series of trajectory replicates, each replicate is in it's own subdirectory. The break down is as follows;
"Main Dir" -> "SubDir1" -> "ReplicateDirs 1-6"
From each "ReplicateDir" I want R to pull the "RMSD.dat" table (file) to read from. All of the RMSD.dat files have identical names, they are just in different directories and contain different data of course.
I could move all the files to one folder but this doesn't seem like the most efficient way to attack this problem.
If anyone could enlighten me, I'd appreciate it.
Thanks
This should work, of course change My Dir to your directory
dat.files <- list.files(path="Main Dir",
recursive=T,
pattern="RMSD.dat"
,full.names=T)
If you want to read the files into the data set, you could use the function below:
readDatFile <- function(f) {
dat.fl <- read.csv(f) # You may have to change read.csv to match your data type
}
And apply to the list of files:
data.files <- sapply(dat.files, readDatFile)
Situation
I wrote an R program which I split up into multiple R-files for the sake of keeping a good code structure.
There is a Main.R file which references all the other R-files with the 'source()' command, like this:
source(paste(getwd(), dirname1, 'otherfile1.R', sep="/"))
source(paste(getwd(), dirname3, 'otherfile2.R', sep="/"))
...
As you can see, the working directory needs to be set correctly in advance, otherwise, this could go wrong.
Now, if I want to share this R program with someone else, I have to pass all the R files and folders in relative order of each other for things to work. Hence my next question.
Question
Is there a way to replace all the 'source' commands with the actual R script code which it refers to? That way, I have a SINGLE R script file, which I can simply pass along without having to worry about setting the working directory.
I'm not looking for a solution which is an 'R package' (which by the way is one single directory, so I would lose my own directory structure). I simply wondering if there is an easy way to combine these self-referencing R files into one single file.
Thanks,
Ok I think you could use something like scaning all the files and then writting them again in the same new one. This can be done using readLines and sink:
sink("mynewRfile.R")
for(i in Nfiles){
current_file = readLines(filedir[i])
cat("\n\n#### Current file:",filedir[i],"\n\n")
cat(current_file, sep ="\n")
}
sink()
Here I have supposed all your file directories are in a vector filedir with length Nfiles, I guess you can adapt that
I am working with R in several directories containing model output I'd like to analyse and plot. I maintain a single 'scripts' directory for this project.
I'd like to be able to 'point' an environment variable at this scripts directory so that I could tab complete source(...) commands. Is this a possibility?
So far, I've managed to create an RPATH environment variable, and have written a function in my .Rprofile which lists the directory's contents without me having to type it out. I can't quite figure how I'd get tab completion though.
Any help/advice would be greatly appreciated.
I wrote a list of different functions and script and I put them in some subfolders of the working directory so I can divide all my functions in arguments (descriptive statistic, geostatistic, regression....)
When I type source("function_in_subfolder") R tells me that there is no function.
I understood that it happens because functions have to stay in the working directory.
Is there a way to set also subfolders of the working directory as source for the functions (let's say in a hierarchical way)?
The source function has a chdir argument which, if set to TRUE, will set the working directory to that where the script resides. The new work directory is valid for the duration of the execution of the script, after that it is changed back. Assumung the following structure
main.R
one/
script.R
two/
subscript.R
you can call source("one/script.R", chdir=T) from main.R and, in script.R, call source("two/subscript.R", chdir=T).
However, by default, R will start its search from the current directory. There is no such thing as a "list of search paths" like, e.g., the PATH environment variable, although apparently someone attempted to create such a thing. I would strongly advise against attempting to find a script file "anywhere". Instead, indicate precisely which script is to be run at which point. Otherwise, name clashes resulting from simply adding a file to your scripts can lead to unpredictable behavior which is also difficult to debug.
One solution is to use list.files to get the full path of your function. for example:
myfunction.path <- list.files(getwd(),
recursive=TRUE,full.names=TRUE,
pattern='^myfunction.R$')
Then you can call it :
source(myfunction.path)
The recursive call of list.files can be expensive, so maybe you should call it once at the beginning of your analyze for example and store all functions paths in a named list. And BE CAREFUL the result can not be unique if you create 2 sources files withe the same name in 2 differents sub-directories.