List files pattern with "AND" condition in R - r

I want to have a list which includes a character string and which excludes temporary files which include the same character string.
For example:
Files <- list.files(pattern= "Coca_cola" & "^[^~]")
I found on previous topics on this site that I have to use the "grep" function but I don't know how to use it in this context.

You can select the files that start with "Coca_cola" which will also drop the temporary files that begin with "~".
Files <- list.files(pattern= "^Coca_cola")

Would this do the trick?
Files <- grepl("Coca_cola", list.files)
Files <- !grepl("~", Files)

Related

R: Deleting Files in R based on their Names

I am working with the R programming language.
I found the following related question Stackoverflow (how to delete a file with R?) which shows you how to delete a file having a specific name from the working directory:
#Define the file name that will be deleted
fn <- "foo.txt"
#Check its existence
if (file.exists(fn)) {
#Delete file if it exists
file.remove(fn)
}
[1] TRUE
My Question: Is it possible to delete files based on whether the file name contains a specific combination of letters (i.e. LIKE 'fo%' )? This way, all files in the working directory starting with the letters "fo" will be deleted.
What I tried so far:
I thought of a way where I could first create a list of all files in the working directory that I want to delete based on their names:
# create list of all files in working directory
a = getwd()
path.to.csv <- a
files<-list.files(path.to.csv)
my_list = print(files) ## list all files in path
#identify files that match the condition
to_be_deleted = my_list[grepl("fo",unlist(my_list))]
Then, I tried to deleted this file using the command used earlier:
if (file.exists(to_be_deleted)) {
#Delete file if it exists
file.remove(to_be_deleted)
}
This returned the following message:
[1] TRUE TRUE TRUE TRUE TRUE TRUE
Warning message:
In if (file.exists(to_be_deleted)) { :
the condition has length > 1 and only the first element will be used
Does anyone know if I have done this correctly? Suppose if there were multiple files in the working directory where the names of these files started with "fo" - would all of these files have been deleted? Or only the first file in this list?
Can someone please show me how to do this correctly?
Thanks!
file.remove accepts a list of file to delete.
Regarding file.exists, it also accepts a list, but it will return a list of logical values, one for each file. And this won't work with if, which requires only one logical value.
However, you don't need to check the existence of files that you get from list.files: they obviously exist.
So, the simplest is to remove the if test and just call file.remove:
files <- list.files(path, pattern = "fo")
to_be_deleted <- grep("fo", files, value = T)
file.remove(to_be_deleted)
Or even simpler:
to_be_deleted <- list.files(path, pattern = "fo")
file.remove(to_be_deleted)
A few notes however:
Here you don't know in advance if you have the right to delete these
files.
You don't know either if the names are indeed files, or
directory (or something else). It's tempting to believe that
file.exists answer the second question, that is, it might tell you
that a name is a real file, but actually it does not:
file.exists(path) returns TRUE also when path is a directory.
However you can detect directories with dir.exists(path). Depending
on your specific case, it may or may not be necessary to check for
this (for instance, if you know the pattern passed to grep always
filters files, it's ok).

Delete files conditionally

I have a series of csv files in a folder and I want to delete them conditionally. The filename formats are like so :-
testfile_2020_05_01.csv
testfile_2020_05_02.csv
testfile_2020_05_03.csv
testfile_2020_05_04.csv
testfile_2020_05_05.csv
testfile_2020_05_06.csv
testfile_2020_05_07.csv
testfile_2020_05_08.csv
testfile_2020_05_09.csv
testfile_2020_05_10.csv
testfile_2020_05_11.csv
There are other files in the folder too. I want to delete the above files by specifying the following conditions :-
All files with filename beginning with "testfile" and ending with ".csv" with the length of the filename equal to 23.
Is there a way to do it in R?
I couldn't do it - I tried a lot of things like identifying ending in ".csv" or beginning with "testfile" but don't know how to combine multiple conditions with "list.files" command select these files.
Once I can select these files conditonally, I can then delete them using a loop unless there is a better and faster way of doing so.
Any suggestions/pointers would be greatly appreciated.
Best regards
Deepak
You can use :
#Get full path name for all the file
all_files <- list.files('/path/to/files', full.names = TRUE)
#Select conditionally files which start with testfile and end with csv
#and have 23 characters in them
files_to_delete <- all_files[grepl('^testfile.*\\.csv$', basename(all_files)) &
nchar(basename(all_files)) == 23]
#delete the files.
do.call(file.remove, list(files_to_delete))

glob2rx, placing a wildcard in the middle of expression and specificying exeptions, r

I have am writing an R script that performs a function for all files in a series of subdirectories. I have ran into a problem where several files in these subdirectories are being recognized by my glob2rx function, and I need help refining my pattern so I can select the file I want.
Here is an example of my directory structure:
subdir1
file1_aaa_111_subdir1.txt
file1_bbb_111_subdir1.txt
file1_aaa_subdir1.txt
subdir2
file1_aaa_111_subdir2.txt
file1_bbb_111_subdir2.txt
file1_aaa_subdir2.txt
I want to select for the last file in each directory, although in my actual directory its position is varied. I want to use something like:
inFilePaths = list.files(path=".", pattern=glob2rx("*aaa*.txt"), full.names=TRUE)
but I dont get any files. In looking at this pattern, I would in theory get both the first and last file in each directory. Meaning I need to write an exception to exclude the aaa_111 files, and keep the aaa_subdir files.
There is a second option I have been thinking about, but lack the ability to realize. Notice the name of the subdirectory is at the end of each file name. Is it possible to extract the directory name, and then combine it with a glob2rx pattern, and then directly specify which file I want? Like this:
#list all the subdirectories
subDirsPaths = list.dirs(path=".", full.names=TRUE)
#perform a function on these directories one by one
for (subDirsPath in subDirsPaths){
#make the subdirectory the working directory
setwd("/home/phil/Desktop/working")
setwd(paste(subDirsPath, sep=""))
# get the working directory name, and trim the "./" from it
directory <- gsub("./", "", paste(subDirsPath, sep=""))
# attempt to the get the desired file by pasting the directory name into the glob2rx funtion
inFilePaths = list.files(path=".", pattern=glob2rx("*aaa_", print(directory,".txt")), full.names=TRUE)
for (inFilePath in inFilePaths)
{
inFileData <- read_tsv(inFilePath, col_names=TRUE)
}
}
With some modification the second option worked well. I ended up using paste in combination with print as follows:
inFilePaths = list.files(path=".", pattern=glob2rx(print(paste("*", "aaa_", directory, ".txt", sep=""))), full.names=TRUE)
The paste function combined the text into a single string, which also preserved the wildcard. The print function added this to the list.files function as the glob2rx pattern.
While this doesn't allow me to place a wild card in the middle of an expression, which I believe is done use an escape character, and it doesn't address the need to place exceptions on the wild card, it works for my purposes.
I hope this helps others in my position.

Sys.glob () within unzip ()

TLDNR: How do I use Sys.glob () within unzip ()?
I have multiple .zip files and I want to extract only one file from each archive.
For example, one of the archives contains the following files:
[1] "cmc-20150531.xml" "cmc-20150531.xsd" "cmc-20150531_cal.xml" "cmc-20150531_def.xml" "cmc-20150531_lab.xml"
[6] "cmc-20150531_pre.xml"
I want to extract the first file because it matches a pattern. In order to do that I use the following command:
unzip("zip-archive.zip", files=Sys.glob("[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"))
However, the command doesn't work, and I don't know why. R just extracts all files in the archive.
On the other hand, the following command works:
unzip("zip-archive.zip", files="cmc-20150531.xml")
How do I use Sys.glob() within unzip()?
Sys.glob expands files that already exist. So the parameter to your unzip call will depend on what files are in your working directory.
Perhaps you want to do unzip with list=TRUE to return the list of files in the zip first, and then use some pattern matching to select the files you want.
See ?grep for info on matching strings with patterns. These patterns are "regular expressions" rather than "glob" expansions, but you should be able to work with that.
Here's a concrete example:
# whats in the zip?
files = unzip("c.zip", list=TRUE)$Name
files
[1] "l_spatial.dbf" "l_spatial.shp" "l_spatial.shx" "ls_polys_bin.dbf"
[5] "ls_polys_bin.shp" "ls_polys_bin.shx" "rast_jan90.tif"
# what files have "dbf" in them:
files[grepl("dbf",files)]
[1] "l_spatial.dbf" "ls_polys_bin.dbf"
# extract just those:
unzip("c.zip", files=files[grepl("dbf",files)])
The regular expression for your glob
"[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"
would be
"^[a-z]{3}-[0-9]{8}\\.xml$"
that's a match of start of string ("^"), 3 a-z (lower case only), a dash, eight digits, a dot (backslashes are needed, one because dot means "any one char" in regexps and another because R needs a backslash to escape a backslash), "xml", and the end of the string ("$").
Just with any other collections do an itertive loop through the results from Sys.glob and supply the itertive holding variable to unzip. This is achieved by using a for-loop
While unzip() takes an argument for the path, and files is an arugment for what files within that zip file.
Mined you I'm more a full stack programmer not so much so on the R lang, but the concepts are the same; so the code should something like:
files <- Sys.glob(path_expand(".","*.zip"))
for (idx in 1:length(files)) {
results = unzip(files[idx], "*.xml")
}
As for using regex in unzip() that is something one should read the documentation. I could only advise doing another for-loop to compare the contest of the zip file to your regex then preforming the extraction. Psudocode follows:
files ::= glob(*.zip)
regex ::=
for idx1 in length(files); do
regex="[a-z]{3}\-[0-9]{8}\.xml"
content = unzip(files[idx1])
for idx2 in length(content); do
if content[idx2].name ~= regex.expand(); then
# do something with found file
end if
end for
end for
Basically your just looping through your list of zip files, then through the list of files within the zip file and comparing the filename from inside your zipfile agenst the regex and extracting/preforming operations on only that file.

Using R to list all files with a specified extension

I'm very new to R and am working on updating an R script to iterate through a series of .dbf tables created using ArcGIS and produce a series of graphs.
I have a directory, C:\Scratch, that will contain all of my .dbf files. However, when ArcGIS creates these tables, it also includes a .dbf.xml file. I want to remove these .dbf.xml files from my file list and thus my iteration. I've tried searching and experimenting with regular expressions to no avail. This is the basic expression I'm using (Excluding all of the various experimentation):
files <- list.files(pattern = "dbf")
Can anyone give me some direction?
files <- list.files(pattern = "\\.dbf$")
$ at the end means that this is end of string. "dbf$" will work too, but adding \\. (. is special character in regular expressions so you need to escape it) ensure that you match only files with extension .dbf (in case you have e.g. .adbf files).
Try this which uses globs rather than regular expressions so it will only pick out the file names that end in .dbf
filenames <- Sys.glob("*.dbf")
Peg the pattern to find "\\.dbf" at the end of the string using the $ character:
list.files(pattern = "\\.dbf$")
Gives you the list of files with full path:
Sys.glob(file.path(file_dir, "*.dbf")) ## file_dir = file containing directory
I am not very good in using sophisticated regular expressions, so I'd do such task in the following way:
files <- list.files()
dbf.files <- files[-grep(".xml", files, fixed=T)]
First line just lists all files from working dir. Second one drops everything containing ".xml" (grep returns indices of such strings in 'files' vector; subsetting with negative indices removes corresponding entries from vector).
"fixed" argument for grep function is just my whim, as I usually want it to peform crude pattern matching without Perl-style fancy regexprs, which may cause surprise for me.
I'm aware that such solution simply reflects drawbacks in my education, but for a novice it may be useful =) at least it's easy.

Resources