Using index number of file in directory - r

I'm using the list.files function in R. I know how to tell it to access all files in a directory, such as:
list.files("directory", full.names=TRUE)
But I don't really know how to subset the directory. If I just want list.files to list the 2nd, 5th, and 6th files in the directory, is there a way to tell list.files to only list those files? I've been thinking about whether it's possible to use the files' indices within the directory but I can't figure out how to do it. It's okay if I can only do this with consecutive files (such as 1:3) but non-consecutive would be even better.
The context of the question is that this is for a problem for a class, so I'm not worried about the files in the directory changing or being deleted.

If you store the list.files to an object say object you can see that it is just an atomic vector of class character (nothing more nothing less!). You can subset it with the regex syntax for character strings (and functions that uses regex like grep or grepl) or just with the regular subsetting operators [ or (most important) by combining both techniques.
For your example:
object[c(2,5,6)]
or exclude with:
object[-c(2,5,6)]
or if you want to find all names that start with the shuttle string with:
object[grepl("^shuttle", object)]
or with the following code if you want to find all .csv files:
object[grepl(".csv$", object)]
possibilities are huge.

Related

regex for a string that does NOT contain a certain phrase not working in `list.files`

I want to remove a certain element from a list of files/folders that I‘ve read in with list.files.
I.e. the result of my list.files contains several files and one folder that does end with '_files', e.g. c('test.html', 'test_files').
I want to remove the folder from that list.
It works when I have the list of files/folders and then do sth. like stringr::str_subset(c('test.html', 'test_files'), '^(?!.*_files).*$').
However, if I specify the above regex in the list.files command as the pattern argument, I‘m getting an error that the pattern argument is invalid.
What am I missing and how can I directly return a list of files that does not contain elements ending with '_files'?

R - Use Glob Pattern to Extract text files from multiple directories

I'm trying to work out a way of extracting text files from multiple directories using fs:dir_ls and vroom.
The directory structure is essentially M:/instrument/project/experiment/measurements/time_stamp/raw_data/files*.txt.
Ideally, I want to be able to define the path to the experiment level then let the pattern take care of the rest, for example -
fs::dir_ls(path="M:/instrument/project/", glob = "experiment_*/2021-04-11*/raw_data/files*.txt", invert = TRUE, recurse = TRUE),
So I'm reading in all the .txt files across multiple experiment directories in one go, however, when I try this approach it returns all the files from the project level rather than those from the specific folders described by the pattern.
I've looked through the other SO questions on the topic covered here: Pattern matching using a wildcard, R list files with multiple conditions, list.files pattern argument in R, extended regular expression use, and grep using a character vector with multiple patterns, but haven't been able to apply them to my particular problem.
Any help is appreciated, I realise the answer is likely staring me in the face, I just need help seeing it.
Thanks
You can try the following with list.files :
files <- list.files('M:/Operetta/LED_Wound/operetta_export/plate_variability[540]/robot_seed_wide_plate_1[1614]/2021-05-10T113438+0100[1764]/SC_data', pattern = 'arpe19*\\.txt')

Using R to read all files in a specific format and with specific extension

I want to read all the files in the xlsx format starting with a string named "csmom". I have used list.files function. But I do not know how to set double pattern. Please see the code. I want to read all the files starting csmom string and they all should be in .xlsx format.
master1<-list.files(path = "C:/Users/Admin/Documents/csmomentum funal",pattern="^csmom")
master2<-list.files(path = "C:/Users/Admin/Documents/csmomentum funal",pattern="^\\.xlsx$")
#jay.sf solution works for creating a regular expression to pull out the condition that you want.
However, generally speaking if you want to cross two lists to find the subset of elements that are contained in both (in your case the files that satisfy both conditions), you can use intersect().
intersect(master1, master2)
Will show you all the files that satisfy pattern 1 and pattern 2.

sorting files and read according to basename in R

This question may have asked before, but I could not get how to do it. I have some files stored in different folders, say folder1, folder2, folder3. I want to sort these files according to their basename and create a list of dataframe. This is what I have done.
files1<-list.files("a/b/c/folder1/",pattern=".csv",full.names=T)
files2<-list.files("a/b/c/folder2/",pattern=".csv",full.names=T)
files3<-list.files("a/b/c/folder3/",pattern=".csv",full.names=T)
# Create a list to sort the files
files<-c(files1,files2,files3)
newlist<-sort(basename(files))
This will give the sorted list of files according to their basename, a01.csv, b02.csv..etc.
I try to read the sorted list of files but now I don't have the pathname so I am not able to read them.
readfiles<-lapply(newlist,function(x){read.csv(x,sep=",",stringsAsFactors=F,header=T)})
Is there any way to read this sorted list of files?
Just use order to get an ordered vector of indices to rearrange the original vector of files:
files <- c("path/b01.csv","path/a01.csv", "path/a02.csv")
files[order(basename(files))]
[1] "path/a01.csv" "path/a02.csv" "path/b01.csv"

Using R to list all files with a specified extension

I'm very new to R and am working on updating an R script to iterate through a series of .dbf tables created using ArcGIS and produce a series of graphs.
I have a directory, C:\Scratch, that will contain all of my .dbf files. However, when ArcGIS creates these tables, it also includes a .dbf.xml file. I want to remove these .dbf.xml files from my file list and thus my iteration. I've tried searching and experimenting with regular expressions to no avail. This is the basic expression I'm using (Excluding all of the various experimentation):
files <- list.files(pattern = "dbf")
Can anyone give me some direction?
files <- list.files(pattern = "\\.dbf$")
$ at the end means that this is end of string. "dbf$" will work too, but adding \\. (. is special character in regular expressions so you need to escape it) ensure that you match only files with extension .dbf (in case you have e.g. .adbf files).
Try this which uses globs rather than regular expressions so it will only pick out the file names that end in .dbf
filenames <- Sys.glob("*.dbf")
Peg the pattern to find "\\.dbf" at the end of the string using the $ character:
list.files(pattern = "\\.dbf$")
Gives you the list of files with full path:
Sys.glob(file.path(file_dir, "*.dbf")) ## file_dir = file containing directory
I am not very good in using sophisticated regular expressions, so I'd do such task in the following way:
files <- list.files()
dbf.files <- files[-grep(".xml", files, fixed=T)]
First line just lists all files from working dir. Second one drops everything containing ".xml" (grep returns indices of such strings in 'files' vector; subsetting with negative indices removes corresponding entries from vector).
"fixed" argument for grep function is just my whim, as I usually want it to peform crude pattern matching without Perl-style fancy regexprs, which may cause surprise for me.
I'm aware that such solution simply reflects drawbacks in my education, but for a novice it may be useful =) at least it's easy.

Resources