Using R to list all files with a specified extension - r

I'm very new to R and am working on updating an R script to iterate through a series of .dbf tables created using ArcGIS and produce a series of graphs.
I have a directory, C:\Scratch, that will contain all of my .dbf files. However, when ArcGIS creates these tables, it also includes a .dbf.xml file. I want to remove these .dbf.xml files from my file list and thus my iteration. I've tried searching and experimenting with regular expressions to no avail. This is the basic expression I'm using (Excluding all of the various experimentation):
files <- list.files(pattern = "dbf")
Can anyone give me some direction?

files <- list.files(pattern = "\\.dbf$")
$ at the end means that this is end of string. "dbf$" will work too, but adding \\. (. is special character in regular expressions so you need to escape it) ensure that you match only files with extension .dbf (in case you have e.g. .adbf files).

Try this which uses globs rather than regular expressions so it will only pick out the file names that end in .dbf
filenames <- Sys.glob("*.dbf")

Peg the pattern to find "\\.dbf" at the end of the string using the $ character:
list.files(pattern = "\\.dbf$")

Gives you the list of files with full path:
Sys.glob(file.path(file_dir, "*.dbf")) ## file_dir = file containing directory

I am not very good in using sophisticated regular expressions, so I'd do such task in the following way:
files <- list.files()
dbf.files <- files[-grep(".xml", files, fixed=T)]
First line just lists all files from working dir. Second one drops everything containing ".xml" (grep returns indices of such strings in 'files' vector; subsetting with negative indices removes corresponding entries from vector).
"fixed" argument for grep function is just my whim, as I usually want it to peform crude pattern matching without Perl-style fancy regexprs, which may cause surprise for me.
I'm aware that such solution simply reflects drawbacks in my education, but for a novice it may be useful =) at least it's easy.

Related

Using R to read all files in a specific format and with specific extension

I want to read all the files in the xlsx format starting with a string named "csmom". I have used list.files function. But I do not know how to set double pattern. Please see the code. I want to read all the files starting csmom string and they all should be in .xlsx format.
master1<-list.files(path = "C:/Users/Admin/Documents/csmomentum funal",pattern="^csmom")
master2<-list.files(path = "C:/Users/Admin/Documents/csmomentum funal",pattern="^\\.xlsx$")
#jay.sf solution works for creating a regular expression to pull out the condition that you want.
However, generally speaking if you want to cross two lists to find the subset of elements that are contained in both (in your case the files that satisfy both conditions), you can use intersect().
intersect(master1, master2)
Will show you all the files that satisfy pattern 1 and pattern 2.

Reading multiple csv files from a folder with R using regex

I wish to use R to read multiple csv files from a single folder. If I wanted to read every csv file I could use:
list.files(folder, pattern="*.csv")
See, for example, these questions:
Reading multiple csv files from a folder into a single dataframe in R
Importing multiple .csv files into R
However, I only wish to read one of four subsets of the files at a time. Below is an example grouping of four files each for three models.
JS.N_Nov6_2017_model220_N200.csv
JS.N_Nov6_2017_model221_N200.csv
JS.N_Nov6_2017_model222_N200.csv
my.IDs.alt_Nov6_2017_model220_N200.csv
my.IDs.alt_Nov6_2017_model221_N200.csv
my.IDs.alt_Nov6_2017_model222_N200.csv
parms_Nov6_2017_model220_N200.csv
parms_Nov6_2017_model221_N200.csv
parms_Nov6_2017_model222_N200.csv
supN_Nov6_2017_model220_N200.csv
supN_Nov6_2017_model221_N200.csv
supN_Nov6_2017_model222_N200.csv
If I only wish to read, for example, the parms files I try the following, which does not work:
list.files(folder, pattern="parm*.csv")
I am assuming that I may need to use regex to read a given group of the four groups present, but I do not know.
How can I read each of the four groups separately?
EDIT
I am unsure whether I would have been able to obtain the solution from answers to this question:
Listing all files matching a full-path pattern in R
I may have had to spend a fair bit of time brushing up on regex to apply those answers to my problem. The answer provided below by Mako212 is outstanding.
A quick REGEX 101 explanation:
For the case of matching the beginning and end of the string, which is all you need to do here, the following prinicples apply to match files that are .csv and start with parm:
list.files(folder, pattern="^parm.*?\\.csv")
^ asserts we're at the beginning of the string, so ^parm means match parm, but only if it's at the beginning of the string.
.*? means match anything up until the next part of the pattern matches. In this case, match until we see a period \\.
. means match any character in REGEX, so we need to escape it with \\ to match the literal . (note that in R you need the double escape \\, in other languages a single escape \ is sufficienct).
Finally csv means match csv after the .. If we were going to be really thorough, we might use \\.csv$ using the $ to indicate the end of the string. You'd need the dollar sign if you had other files with an extension like .csv2. \\.csv would match .csv2, where as \\.csv$ would not.
In your case, you could simply replace parm in the REGEX pattern with JS, my, or supN to select one of your other file types.
Finally, if you wanted to match a subset of your total file list, you could use the | logical "or" operator:
list.files(folder, pattern = "^(parm|JS|supN).*?\\.csv")
Which would return all the file names except the ones that start with my
The list.files statement shown in the question is using globs but list.files accepts regular expressions, not globs.
Sys.glob To use globs use Sys.glob like this:
olddir <- setwd(folder)
parm <- lapply(Sys.glob("parm*.csv"), read.csv)
parm is now a list of data frames read in from those files.
glob2rx Note that the glob2rx function can be used to convert globs to regular expressions:
parm <- lapply(list.files(folder, pattern = glob2rx("parm*.csv")), read.csv)

Using index number of file in directory

I'm using the list.files function in R. I know how to tell it to access all files in a directory, such as:
list.files("directory", full.names=TRUE)
But I don't really know how to subset the directory. If I just want list.files to list the 2nd, 5th, and 6th files in the directory, is there a way to tell list.files to only list those files? I've been thinking about whether it's possible to use the files' indices within the directory but I can't figure out how to do it. It's okay if I can only do this with consecutive files (such as 1:3) but non-consecutive would be even better.
The context of the question is that this is for a problem for a class, so I'm not worried about the files in the directory changing or being deleted.
If you store the list.files to an object say object you can see that it is just an atomic vector of class character (nothing more nothing less!). You can subset it with the regex syntax for character strings (and functions that uses regex like grep or grepl) or just with the regular subsetting operators [ or (most important) by combining both techniques.
For your example:
object[c(2,5,6)]
or exclude with:
object[-c(2,5,6)]
or if you want to find all names that start with the shuttle string with:
object[grepl("^shuttle", object)]
or with the following code if you want to find all .csv files:
object[grepl(".csv$", object)]
possibilities are huge.

Sys.glob () within unzip ()

TLDNR: How do I use Sys.glob () within unzip ()?
I have multiple .zip files and I want to extract only one file from each archive.
For example, one of the archives contains the following files:
[1] "cmc-20150531.xml" "cmc-20150531.xsd" "cmc-20150531_cal.xml" "cmc-20150531_def.xml" "cmc-20150531_lab.xml"
[6] "cmc-20150531_pre.xml"
I want to extract the first file because it matches a pattern. In order to do that I use the following command:
unzip("zip-archive.zip", files=Sys.glob("[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"))
However, the command doesn't work, and I don't know why. R just extracts all files in the archive.
On the other hand, the following command works:
unzip("zip-archive.zip", files="cmc-20150531.xml")
How do I use Sys.glob() within unzip()?
Sys.glob expands files that already exist. So the parameter to your unzip call will depend on what files are in your working directory.
Perhaps you want to do unzip with list=TRUE to return the list of files in the zip first, and then use some pattern matching to select the files you want.
See ?grep for info on matching strings with patterns. These patterns are "regular expressions" rather than "glob" expansions, but you should be able to work with that.
Here's a concrete example:
# whats in the zip?
files = unzip("c.zip", list=TRUE)$Name
files
[1] "l_spatial.dbf" "l_spatial.shp" "l_spatial.shx" "ls_polys_bin.dbf"
[5] "ls_polys_bin.shp" "ls_polys_bin.shx" "rast_jan90.tif"
# what files have "dbf" in them:
files[grepl("dbf",files)]
[1] "l_spatial.dbf" "ls_polys_bin.dbf"
# extract just those:
unzip("c.zip", files=files[grepl("dbf",files)])
The regular expression for your glob
"[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"
would be
"^[a-z]{3}-[0-9]{8}\\.xml$"
that's a match of start of string ("^"), 3 a-z (lower case only), a dash, eight digits, a dot (backslashes are needed, one because dot means "any one char" in regexps and another because R needs a backslash to escape a backslash), "xml", and the end of the string ("$").
Just with any other collections do an itertive loop through the results from Sys.glob and supply the itertive holding variable to unzip. This is achieved by using a for-loop
While unzip() takes an argument for the path, and files is an arugment for what files within that zip file.
Mined you I'm more a full stack programmer not so much so on the R lang, but the concepts are the same; so the code should something like:
files <- Sys.glob(path_expand(".","*.zip"))
for (idx in 1:length(files)) {
results = unzip(files[idx], "*.xml")
}
As for using regex in unzip() that is something one should read the documentation. I could only advise doing another for-loop to compare the contest of the zip file to your regex then preforming the extraction. Psudocode follows:
files ::= glob(*.zip)
regex ::=
for idx1 in length(files); do
regex="[a-z]{3}\-[0-9]{8}\.xml"
content = unzip(files[idx1])
for idx2 in length(content); do
if content[idx2].name ~= regex.expand(); then
# do something with found file
end if
end for
end for
Basically your just looping through your list of zip files, then through the list of files within the zip file and comparing the filename from inside your zipfile agenst the regex and extracting/preforming operations on only that file.

using R to copy files

As part of a larger task performed in R run under windows, I would like to copy selected files between directories. Is it possible to give within R a command like cp patha/filea*.csv pathb (notice the wildcard, for extra spice)?
I don't think there is a direct way (shy of shelling-out), but something like the following usually works for me.
flist <- list.files("patha", "^filea.+[.]csv$", full.names = TRUE)
file.copy(flist, "pathb")
Notes:
I purposely decomposed in two steps, they can be combined.
See the regular expression: R uses true regex, and also separates the file pattern from the path, in two separate arguments.
note the ^ and $ (beg/end of string) in the regex -- this is a common gotcha, as these are implicit to wildcard-type patterns, but required with regexes (lest some file names which match the wildcard pattern but also start and/or end with additional text be selected as well).
In the Windows world, people will typically add the ignore.case = TRUE argument to list.files, in order to emulate the fact that directory searches are case insensitive with this OS.
R's glob2rx() function provides a convenient way to convert wildcard patterns to regular expressions. For example fpattern = glob2rx('filea*.csv') returns a different but equivalent regex.
You can
use system() to fire off a command as if it was on shell, incl globbing
use list.files() aka dir() to do the globbing / reg.exp matching yourself and the copy the files individually
use file.copy on individual files as shown in mjv's answer

Resources