R: Deleting Files in R based on their Names - r

I am working with the R programming language.
I found the following related question Stackoverflow (how to delete a file with R?) which shows you how to delete a file having a specific name from the working directory:
#Define the file name that will be deleted
fn <- "foo.txt"
#Check its existence
if (file.exists(fn)) {
#Delete file if it exists
file.remove(fn)
}
[1] TRUE
My Question: Is it possible to delete files based on whether the file name contains a specific combination of letters (i.e. LIKE 'fo%' )? This way, all files in the working directory starting with the letters "fo" will be deleted.
What I tried so far:
I thought of a way where I could first create a list of all files in the working directory that I want to delete based on their names:
# create list of all files in working directory
a = getwd()
path.to.csv <- a
files<-list.files(path.to.csv)
my_list = print(files) ## list all files in path
#identify files that match the condition
to_be_deleted = my_list[grepl("fo",unlist(my_list))]
Then, I tried to deleted this file using the command used earlier:
if (file.exists(to_be_deleted)) {
#Delete file if it exists
file.remove(to_be_deleted)
}
This returned the following message:
[1] TRUE TRUE TRUE TRUE TRUE TRUE
Warning message:
In if (file.exists(to_be_deleted)) { :
the condition has length > 1 and only the first element will be used
Does anyone know if I have done this correctly? Suppose if there were multiple files in the working directory where the names of these files started with "fo" - would all of these files have been deleted? Or only the first file in this list?
Can someone please show me how to do this correctly?
Thanks!

file.remove accepts a list of file to delete.
Regarding file.exists, it also accepts a list, but it will return a list of logical values, one for each file. And this won't work with if, which requires only one logical value.
However, you don't need to check the existence of files that you get from list.files: they obviously exist.
So, the simplest is to remove the if test and just call file.remove:
files <- list.files(path, pattern = "fo")
to_be_deleted <- grep("fo", files, value = T)
file.remove(to_be_deleted)
Or even simpler:
to_be_deleted <- list.files(path, pattern = "fo")
file.remove(to_be_deleted)
A few notes however:
Here you don't know in advance if you have the right to delete these
files.
You don't know either if the names are indeed files, or
directory (or something else). It's tempting to believe that
file.exists answer the second question, that is, it might tell you
that a name is a real file, but actually it does not:
file.exists(path) returns TRUE also when path is a directory.
However you can detect directories with dir.exists(path). Depending
on your specific case, it may or may not be necessary to check for
this (for instance, if you know the pattern passed to grep always
filters files, it's ok).

Related

R: locating files that their names contain a specific string from a directory and match to my list of wanted files

It's me the newbie again with another messy file and folder situation(thanks to us biologiests): I got this directory containing a huge amount of .txt files (~900,000+), all the files have been previously handed with inconsistent naming format :(
For example, messy files in directory look like these:
ctrl_S978765_uns_dummy_00_none.txt
ctrl_S978765_3S_Cookie_00_none.txt
S59607_3S_goody_3M_V10.txt
ctrlnuc30-100_S3245678_DMSO_00_none.txt
ctrlRAP_S0846567_3S_Dex_none.txt
S6498432_2S_Fulra_30mM_V100.txt
.....
As you see the naming has no reliable consistency. What's important for me is the ID code embedded in them, such as S978765. Now I have got a list (100 ID codes) of these ID codes that I want.
The CSV file containing the list as below, mind you the list does have repetitive ID codes in the row due to different CLnumber value in the second columns:
ID code CLnumber
S978765 1
S978765 2
S306223 1
S897458 1
S514486 2
....
So I want to achieve below task: find all the messy named files using the code IDs by matching to my list. And copy them into a new directory.
I have thought of use list.files() to get all the .txt files and their names, then I got stuck at the next step at matching the code ID names, I know how to do it with one string, say "S978765", but if I do it one by one, this is almost just like manual digging the folder.
How could I feed the ID code names in column1 as a list and compare/match them with the messy file title names in the directory and then copy them into a new folder?
Many thanks,
ML
This works:
library(stringr)
# get this via list.files in your actual code
files <- c("ctrl_S978765_uns_dummy_00_none.txt",
"ctrl_S978765_3S_Cookie_00_none.txt",
"S59607_3S_goody_3M_V10.txt",
"ctrlnuc30-100_S3245678_DMSO_00_none.txt",
"ctrlRAP_S0846567_3S_Dex_none.txt",
"S6498432_2S_Fulra_30mM_V100.txt")
ids <- data.frame(`ID Code` = c("S978765", "S978765", "S306223", "S897458", "S514486"),
CLnumber = c(1, 2, 1, 1, 2),
stringsAsFactors = FALSE)
str_subset(files, paste(ids$ID.Code, collapse = "|"))
#> [1] "ctrl_S978765_uns_dummy_00_none.txt" "ctrl_S978765_3S_Cookie_00_none.txt"
str_subset takes a character vector and returns elements matching some pattern. In this case, the pattern is "S978765|S978765|S306223|S897458|S514486" (created by using paste), which is a regular expression that matches any of the ID codes separated by |. So we take files and keep only the elements that have a match in ID Code.
There are many other ways to do this, which may or may not be more clear. For example, you could pass ids$ID.Code directly to str_subset instead of constructing a regular expression via paste, but that would throw a warning about object lengths every time, which could get confusing (or cause problems if you get used to ignoring it and then ignore it in a different context where it matters). Another method would be to use purrr and keep, but while that might be a little bit more clear to write, it would be a lot more inefficient since it would mean making multiple passes over the files vector -- not relevant in this context, but possibly very relevant if you suddenly need to do this for hundreds of thousands of files and IDs.
You could use regex to extract the ID codes from the file name.
Here, I have used the pattern "S" followed by 5 or more numbers. Once we extract the ID_codes, we can compare them with the ones which we have in csv.
Assuming the csv is called df and the column name is ID_Codes we can use %in% to filter them.
We can then use file.copy to move files from one folder to another folder.
all_files <- list.files(path = '/Path/To/Folder', full.names = TRUE)
selected_files <- all_files[sub('.*(S\\d{5,}).*', '\\1', basename(all_files))
%in% unique(df$ID_Codes)]
file.copy(selected_files, 'new_path/for/files')

glob2rx, placing a wildcard in the middle of expression and specificying exeptions, r

I have am writing an R script that performs a function for all files in a series of subdirectories. I have ran into a problem where several files in these subdirectories are being recognized by my glob2rx function, and I need help refining my pattern so I can select the file I want.
Here is an example of my directory structure:
subdir1
file1_aaa_111_subdir1.txt
file1_bbb_111_subdir1.txt
file1_aaa_subdir1.txt
subdir2
file1_aaa_111_subdir2.txt
file1_bbb_111_subdir2.txt
file1_aaa_subdir2.txt
I want to select for the last file in each directory, although in my actual directory its position is varied. I want to use something like:
inFilePaths = list.files(path=".", pattern=glob2rx("*aaa*.txt"), full.names=TRUE)
but I dont get any files. In looking at this pattern, I would in theory get both the first and last file in each directory. Meaning I need to write an exception to exclude the aaa_111 files, and keep the aaa_subdir files.
There is a second option I have been thinking about, but lack the ability to realize. Notice the name of the subdirectory is at the end of each file name. Is it possible to extract the directory name, and then combine it with a glob2rx pattern, and then directly specify which file I want? Like this:
#list all the subdirectories
subDirsPaths = list.dirs(path=".", full.names=TRUE)
#perform a function on these directories one by one
for (subDirsPath in subDirsPaths){
#make the subdirectory the working directory
setwd("/home/phil/Desktop/working")
setwd(paste(subDirsPath, sep=""))
# get the working directory name, and trim the "./" from it
directory <- gsub("./", "", paste(subDirsPath, sep=""))
# attempt to the get the desired file by pasting the directory name into the glob2rx funtion
inFilePaths = list.files(path=".", pattern=glob2rx("*aaa_", print(directory,".txt")), full.names=TRUE)
for (inFilePath in inFilePaths)
{
inFileData <- read_tsv(inFilePath, col_names=TRUE)
}
}
With some modification the second option worked well. I ended up using paste in combination with print as follows:
inFilePaths = list.files(path=".", pattern=glob2rx(print(paste("*", "aaa_", directory, ".txt", sep=""))), full.names=TRUE)
The paste function combined the text into a single string, which also preserved the wildcard. The print function added this to the list.files function as the glob2rx pattern.
While this doesn't allow me to place a wild card in the middle of an expression, which I believe is done use an escape character, and it doesn't address the need to place exceptions on the wild card, it works for my purposes.
I hope this helps others in my position.

Using index number of file in directory

I'm using the list.files function in R. I know how to tell it to access all files in a directory, such as:
list.files("directory", full.names=TRUE)
But I don't really know how to subset the directory. If I just want list.files to list the 2nd, 5th, and 6th files in the directory, is there a way to tell list.files to only list those files? I've been thinking about whether it's possible to use the files' indices within the directory but I can't figure out how to do it. It's okay if I can only do this with consecutive files (such as 1:3) but non-consecutive would be even better.
The context of the question is that this is for a problem for a class, so I'm not worried about the files in the directory changing or being deleted.
If you store the list.files to an object say object you can see that it is just an atomic vector of class character (nothing more nothing less!). You can subset it with the regex syntax for character strings (and functions that uses regex like grep or grepl) or just with the regular subsetting operators [ or (most important) by combining both techniques.
For your example:
object[c(2,5,6)]
or exclude with:
object[-c(2,5,6)]
or if you want to find all names that start with the shuttle string with:
object[grepl("^shuttle", object)]
or with the following code if you want to find all .csv files:
object[grepl(".csv$", object)]
possibilities are huge.

Reading a file into R with partly unknown filename

Is there a way to read a file into R where I do not know the complete file name. Something like.
read.csv("abc_*")
In this case I do not know the complete file name after abc_
If you have exactly one file matching your criteria, you can do it like this:
read.csv(dir(pattern='^abc_')[1])
If there is more than one file, this approach would just use the first hit. In a more elaborated version you could loop over all matches and append them to one dataframe or something like that.
Note that the pattern uses regular expressions and thus is a bit different from what you did expect (and what I wrongly assumed at my first shot to answer the question). Details can be found using ?regex
If you have a directory you want to submit, you have do modify the dir command accordingly:
read.csv(dir('path/to/your/file', full.names=T, pattern="^abc"))
The submitted path in your case may be c:\\users\\user\\desktop, and then the pattern as above. full.names=T forces dir() to output a whole path and not only the file name. Try running dir(...) without the read.csv to understand what is happening there.
If you want to give your path as a complete string, it again gets a bit more complicated:
filepath <- 'path/to/your/file/abc_'
read.csv(dir(dirname(filepath), full.names=T, pattern=paste("^", basename(filepath), sep='')))
That process will fail if your filename contains any regular expression keywords. You would have to substitute then with their corresponding escape sequences upfront. But that again is another topic.

Using R to list all files with a specified extension

I'm very new to R and am working on updating an R script to iterate through a series of .dbf tables created using ArcGIS and produce a series of graphs.
I have a directory, C:\Scratch, that will contain all of my .dbf files. However, when ArcGIS creates these tables, it also includes a .dbf.xml file. I want to remove these .dbf.xml files from my file list and thus my iteration. I've tried searching and experimenting with regular expressions to no avail. This is the basic expression I'm using (Excluding all of the various experimentation):
files <- list.files(pattern = "dbf")
Can anyone give me some direction?
files <- list.files(pattern = "\\.dbf$")
$ at the end means that this is end of string. "dbf$" will work too, but adding \\. (. is special character in regular expressions so you need to escape it) ensure that you match only files with extension .dbf (in case you have e.g. .adbf files).
Try this which uses globs rather than regular expressions so it will only pick out the file names that end in .dbf
filenames <- Sys.glob("*.dbf")
Peg the pattern to find "\\.dbf" at the end of the string using the $ character:
list.files(pattern = "\\.dbf$")
Gives you the list of files with full path:
Sys.glob(file.path(file_dir, "*.dbf")) ## file_dir = file containing directory
I am not very good in using sophisticated regular expressions, so I'd do such task in the following way:
files <- list.files()
dbf.files <- files[-grep(".xml", files, fixed=T)]
First line just lists all files from working dir. Second one drops everything containing ".xml" (grep returns indices of such strings in 'files' vector; subsetting with negative indices removes corresponding entries from vector).
"fixed" argument for grep function is just my whim, as I usually want it to peform crude pattern matching without Perl-style fancy regexprs, which may cause surprise for me.
I'm aware that such solution simply reflects drawbacks in my education, but for a novice it may be useful =) at least it's easy.

Resources