Need to write a piece of code in R that'll create a list specifying:
Names of subfolders with a pre-set depth (e.g. 2 levels down)
Path
Date modified
I've tried to use the following generic function but had no luck:
list.files(path, pattern=NULL, all.files=FALSE,
full.names=FALSE)
dir(path, pattern=NULL, all.files=FALSE,
full.names=FALSE)
Would very much appreciate your response.
I think what your are missing is the recursive = TRUE parameter in list.files()
One possible solution could be to list all files first and then limit the output to 2 levels accordingly.
files <- list.files(path = "D:/cmder/", recursive = TRUE)
Since R represents paths using "/" a simple example could be to remove everything that has more than 3 slashes if you need a depth of 2.
files[!grepl(".*/.*/.*/.*", files)]
Be careful on windows as you might see a back slash "\" there sometimes, only if your path information comes from something different that R itself, e.g. a csv import.
My grepl() statement can probably be improved as I'm not an expert there.
Related
I have a directory in windows with multiple sub-directories. I also have a list of file names that I would like to search for within the directory structure (to retrieve exact path)
This works fine for a single value as below (current.folder is a variable for the main directory) -
files <- list.files(current.folder,
pattern = test,
recursive = TRUE,
full.names = TRUE)
I can then use the returned path to do file.copy.
The problem I am now having is applying that function to multiple file names stored in a dataframe or even a vector.
I've tried referencing them in the pattern argument (which only returns for first value) and used a for loop on a set of just two filenames (returns blank).
Am I using the wrong technique or just not finding the correct setup?
Edit for clarification - test refers to value - "1-5FX3C7P_1-5FX3C8T_JNJLFSPROD-ZZFDA-CDRH-AERS-3500A-01162017131543-1-5FX3C7P.xml"
Does this work:
sapply(test, function(x)list.files(current.folder,
pattern=x,
recursive=TRUE,
full.names=TRUE))
I have many txt files that I want to import into R. These files are imported one by one, I do the operations that I want, and then I import the next file.
All these files are located in a database system where all the folders have almost the same names, e.g.
database\type4\system50
database\type6\system50
database\type4\system30
database\type4\system50
Similarly, the names of the files are also almost the same, referring to the folder where they are positioned, e.g..
type4.system50.txt
type6.system50.txt
type4.system30.txt
type4.system50.txt
I have heard that there should be a easier way of importing these many files one by one, than simply multiple setwd and read.csv2 commands. As far as I understand this is possible by the macro import function in SAS, where you specify an overall path and then for each time you want to import a file you specify what is specific about this file name/folder name.
Is there a similar function in R? I tried to look at
Importing Data in R like SAS macro
, but this question did not really show me how to specify the folder name/file name.
Thank you for your help.
If you want to specify folder name / file name, try this
databasepath="path/to/database"
## list all files
list.files(getwd(),recursive = T,full.names = T,include.dirs = T) -> tmp
## filter files you want to read
readmyfile <- function(foldername,filename){
tmp[which(grepl(foldername,tmp) & grepl(filename,tmp))]
}
files_to_read <- readmyfile("type4", "system50")
some_files <- lapply(files_to_read, read.csv2)
## Or you can read all of them (if memory is large enough to hold them)
all_files <- lapply(tmp,read.csv2)
Instead of using setwd continuously, you could specify the absolute path for each file, save all of the paths to a vector, loop through the vector of paths and load the files into a list
library(data.table)
file_dir <- "path/to/files/"
file_vec <- list.files(path = file_dir, pattern = "*.txt")
file_list <- list()
for (n in 1:length(file_list)){
file_list[[n]] <- fread(input = paste0(file_dir, file_vec[n]))
}
I'm writing a loop script which involves reading a file from a workbook (using the package XLConnect). The challenge is that the file names contain characters (representing time) that I want to ignore.
For example, here are 3 paths to those files:
G://User//Documents//daily_data//Op_Schedule_20160520_132025.xlsx
G://User//Documents//daily_data//Op_Schedule_20160521_142805.xlsx
G://User//Documents//daily_data//Op_Schedule_20160522_103052.xlsx
I need to import hundreds of those files. I can easily account for the character string representing the date (e.g. 20160522), but not the time.
Is there a way to tell R to ignore some characters located in the file path? Here is how I was thinking of writing my script (the "???" is where i need help). I know a loop is probably not the most efficient way, but i'm open to suggestions, should you have any:
require(XLConnect)
path= "G://User//Documents//daily_data//Op_Schedule_"
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
scheduleList = rep(list(matrix(1,1,1)),length(wd.seq))
for(i in 1:length(wd.seq)) {
wb = loadWorkbook(file= paste0(path,wd.seq[i],"???",".xlxs"))
scheduleList[[i]] = readWorksheet(wb,sheet='=SCHEDULE', header = TRUE)
}
`
Thanks for reading and suggestions, if any.
Mathieu
I don't know if this is helpful, but if you want to read all the files in a certain directory (which it seems to me is what you're after), you can read all the filenames into a list using the list.files() function, for example
fileList <- list.files(""G://User//Documents//daily_data//")
And then load the xlsx files looping through the list with a for loop
for(i in fileList) {
loadWorkbook(file = i)
}
I haven't used the XLConnect function before so that exact code probably doesn't work, but the loop will iterate through all the files in that directory and so you can construct your loading call using the i variable for the filename (it won't be an absolute path though, so you might need to use paste to add the first part of the filepath)
I realize there might be other files in the directory that are not excel files, you could use grepl to select only files containg "OP_Schedule_"
fileListClean <- fileList[grepl("Op_Schedule_",fileList)]
or perhaps only selecting .xlsx files in the directory:
fileListClean <- fileList[grepl(".xlsx",fileList)]
Edit to fit your reply:
Since you need to fit it to a sequence, you can do it as you did earlier:
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
wd.seq2 <- paste("Op_Schedule_", wd.seq, sep = "")
And then use grepl to only pick files starting with that extensions:
fileListClean <- fileList[grepl(paste(wd.seq2, collapse = "|"), fileList)]
Full disclosure: The last part i got from this SO answer: grep using a character vector with multiple patterns
I have a 100 text files with matrices which I want to open using R - the read.table() command can be used for that.
I cant figure out how to assign these files to separate variable names so that I can carry out operations on the matrices.
I am trying to use the for loop but keep getting error messages.
I hope somebody can help me out with this...
If you have 100 files, it may make more sense to simply keep them in one neat list.
# Get the list of files
#----------------------------#
folder <- "path/to/files"
fileList <- dir(folder, recursive=TRUE) # grep through these, if you are not loading them all
# use platform appropriate separator
files <- paste(folder, fileList, sep=.Platform$file.sep)
# Read them in
#----------------------------#
myMatrices <- lapply(files, read.table)
Then access via, eg, myMatrices[[37]] or using lapply
Would it be easer to just use list.files?
For example:
files <- list.files(directory/path, pattern= "regexp.if.needed")
And then you could access each element by calling files[1], files[2], etc. This would allow you to pull out either all the files in a directory, or just the ones that matched a regular expression.
I know this might be very stupid question but I have been spending hours on this
want to read a .csv file that I dont have its full path (*/*data.csv). I know that following would get the path of the current directory but don't know how to adapt
Marks <- read.csv(dir(path = '.', full.names=T, pattern='^data.*\\.csv'))
tried this one as well but not working
Marks <- read.csv(file = "*/*/data.csv", sep = ",", header=FALSE))
I can't identify specific path as this will be used on different machines with different paths but I am sure about the sub-folders of the main directory as they are result of a bash script
and I am planing to call this from within unix which defines the workspace
my data structure is
lecture01/test/data.csv
lecture02/test/data.csv
lecture03/test/data.csv
Your comments -- though not currently your question itself -- indicate you expect to run your code in a working directory that contains some number of subdirectories (lecture01, lecture02, etc), each of which contain a subdirectory 'marks' that in turn contains a data.csv file. If this is so, and your objective is to read the csv from within each subdirectory, then you have a couple of options depending on the remaining details.
Case 1: Specify the top-level directory names directly, if you know them all and they are potentially idiosyncratic:
dirs <- c("lecture01", "lecture02", "some_other_dir")
paths <- file.path(dirs, "marks/data.csv")
Case 2: Construct the top-level directory names, e.g. if they all start with "lecture", followed by a two digit number, and you are able to (or specifically wish to) specify a numeric range, e.g. 01 though 15:
dirs <- sprintf("lecture%02s", 1:15)
paths <- file.path(dirs, "marks/data.csv")
Case 3: Determine the top-level directory names by matching a pattern, e.g. if you want to read data from within every directory starting with the string "lecture":
matched.names <- list.files(".", pattern="^lecture")
dirs <- matched.names[file.info(matched.names)$isdir]
paths <- file.path(dirs, "marks/data.csv")
Once you have a vector of the paths, I'd probably use lapply to read the data into a list for further processing, naming each one with the base directory name:
csv.data <- lapply(paths, read.csv)
names(csv.data) <- dirs
Alternatively, if whatever processing you do on each individual CSV is done just for its side effects, such as modifying the data and writing out a new version, and especially if you don't ever want all of them to be in memory at the same time, then use a loop.
If this answer misses the mark, of even if it doesn't, it would be great if you could clarify the question accordingly.
I have no code but I would do a reclusive glob from the root and do a preg_match to find the .csv file (use glob brace).