How can I search for a particular string in a folder full of files and import it? - r

I have a folder full of files with filenames such as these.
[1] "ts.01382800.crest.csv" "ts.01383500.crest.csv" "ts.01384500.crest.csv" "ts.01386000.crest.csv"
[5] "ts.01387000.crest.csv" "ts.01387400.crest.csv" "ts.01387420.crest.csv" "ts.01387450.crest.csv"
[9] "ts.01387500.crest.csv" "ts.01387908.crest.csv"
I need to do one operation again and again. Basically search for a particular string (Say, 1382800), find the filename with the match and import it. Please note, the name may not be exact match as the preceding zero may not exist sometimes.
Currently, I am importing a list of the files in the folder using list.files, using grep on that list to find a filename, then re-constructing the file path and importing it. There must be an easier way to do it.

This code filter all the data contains 1382800
ls <- list.files(path='~/',pattern="1382800")

Related

R: Deleting Files in R based on their Names

I am working with the R programming language.
I found the following related question Stackoverflow (how to delete a file with R?) which shows you how to delete a file having a specific name from the working directory:
#Define the file name that will be deleted
fn <- "foo.txt"
#Check its existence
if (file.exists(fn)) {
#Delete file if it exists
file.remove(fn)
}
[1] TRUE
My Question: Is it possible to delete files based on whether the file name contains a specific combination of letters (i.e. LIKE 'fo%' )? This way, all files in the working directory starting with the letters "fo" will be deleted.
What I tried so far:
I thought of a way where I could first create a list of all files in the working directory that I want to delete based on their names:
# create list of all files in working directory
a = getwd()
path.to.csv <- a
files<-list.files(path.to.csv)
my_list = print(files) ## list all files in path
#identify files that match the condition
to_be_deleted = my_list[grepl("fo",unlist(my_list))]
Then, I tried to deleted this file using the command used earlier:
if (file.exists(to_be_deleted)) {
#Delete file if it exists
file.remove(to_be_deleted)
}
This returned the following message:
[1] TRUE TRUE TRUE TRUE TRUE TRUE
Warning message:
In if (file.exists(to_be_deleted)) { :
the condition has length > 1 and only the first element will be used
Does anyone know if I have done this correctly? Suppose if there were multiple files in the working directory where the names of these files started with "fo" - would all of these files have been deleted? Or only the first file in this list?
Can someone please show me how to do this correctly?
Thanks!
file.remove accepts a list of file to delete.
Regarding file.exists, it also accepts a list, but it will return a list of logical values, one for each file. And this won't work with if, which requires only one logical value.
However, you don't need to check the existence of files that you get from list.files: they obviously exist.
So, the simplest is to remove the if test and just call file.remove:
files <- list.files(path, pattern = "fo")
to_be_deleted <- grep("fo", files, value = T)
file.remove(to_be_deleted)
Or even simpler:
to_be_deleted <- list.files(path, pattern = "fo")
file.remove(to_be_deleted)
A few notes however:
Here you don't know in advance if you have the right to delete these
files.
You don't know either if the names are indeed files, or
directory (or something else). It's tempting to believe that
file.exists answer the second question, that is, it might tell you
that a name is a real file, but actually it does not:
file.exists(path) returns TRUE also when path is a directory.
However you can detect directories with dir.exists(path). Depending
on your specific case, it may or may not be necessary to check for
this (for instance, if you know the pattern passed to grep always
filters files, it's ok).

How can I read in excel files by looking for a string pattern?

I need to read in a bunch of files that are scattered across different directories.
The problem is, these files all have slightly different naming variations such as:
7-2018 RECON.xlsx
RECON 06-2019.xlsx
5-31-2017 RECON LINKED.xlsx
I want to read in excel files to look for the keyword "RECON" in the file name.
I tried using the contains function in the read_excel function - didn't work.
Any ideas?
Thanks!
You could identify the list of relevant files and get their paths with something like:
> normalizePath(list.files(pattern="Rmd", ignore.case=TRUE, recursive=TRUE))
[1] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture1/Lecture1.Rmd"
[2] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture2/lec2_inclass.Rmd"
[3] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture2/lecture2.rmd"
[4] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture3/lecture3.rmd"
You would probably want a pattern like ".*RECON.*\\.xlsx$" which would find <anything>RECON<anything>.xlsx<end of string>. You could save the result as a vector of file names and then loop over them to read them in.

Sys.glob () within unzip ()

TLDNR: How do I use Sys.glob () within unzip ()?
I have multiple .zip files and I want to extract only one file from each archive.
For example, one of the archives contains the following files:
[1] "cmc-20150531.xml" "cmc-20150531.xsd" "cmc-20150531_cal.xml" "cmc-20150531_def.xml" "cmc-20150531_lab.xml"
[6] "cmc-20150531_pre.xml"
I want to extract the first file because it matches a pattern. In order to do that I use the following command:
unzip("zip-archive.zip", files=Sys.glob("[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"))
However, the command doesn't work, and I don't know why. R just extracts all files in the archive.
On the other hand, the following command works:
unzip("zip-archive.zip", files="cmc-20150531.xml")
How do I use Sys.glob() within unzip()?
Sys.glob expands files that already exist. So the parameter to your unzip call will depend on what files are in your working directory.
Perhaps you want to do unzip with list=TRUE to return the list of files in the zip first, and then use some pattern matching to select the files you want.
See ?grep for info on matching strings with patterns. These patterns are "regular expressions" rather than "glob" expansions, but you should be able to work with that.
Here's a concrete example:
# whats in the zip?
files = unzip("c.zip", list=TRUE)$Name
files
[1] "l_spatial.dbf" "l_spatial.shp" "l_spatial.shx" "ls_polys_bin.dbf"
[5] "ls_polys_bin.shp" "ls_polys_bin.shx" "rast_jan90.tif"
# what files have "dbf" in them:
files[grepl("dbf",files)]
[1] "l_spatial.dbf" "ls_polys_bin.dbf"
# extract just those:
unzip("c.zip", files=files[grepl("dbf",files)])
The regular expression for your glob
"[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"
would be
"^[a-z]{3}-[0-9]{8}\\.xml$"
that's a match of start of string ("^"), 3 a-z (lower case only), a dash, eight digits, a dot (backslashes are needed, one because dot means "any one char" in regexps and another because R needs a backslash to escape a backslash), "xml", and the end of the string ("$").
Just with any other collections do an itertive loop through the results from Sys.glob and supply the itertive holding variable to unzip. This is achieved by using a for-loop
While unzip() takes an argument for the path, and files is an arugment for what files within that zip file.
Mined you I'm more a full stack programmer not so much so on the R lang, but the concepts are the same; so the code should something like:
files <- Sys.glob(path_expand(".","*.zip"))
for (idx in 1:length(files)) {
results = unzip(files[idx], "*.xml")
}
As for using regex in unzip() that is something one should read the documentation. I could only advise doing another for-loop to compare the contest of the zip file to your regex then preforming the extraction. Psudocode follows:
files ::= glob(*.zip)
regex ::=
for idx1 in length(files); do
regex="[a-z]{3}\-[0-9]{8}\.xml"
content = unzip(files[idx1])
for idx2 in length(content); do
if content[idx2].name ~= regex.expand(); then
# do something with found file
end if
end for
end for
Basically your just looping through your list of zip files, then through the list of files within the zip file and comparing the filename from inside your zipfile agenst the regex and extracting/preforming operations on only that file.

List files with specific word and file extension

I have a bunch of files in a directory that contain various extensions, but the ones I'm most interested in are *.bil. For each year there are 12 files.
What I'm stuck on is matching a year with *.bio, so my list will have 12 files for year 2000.
Example filenames:
**** Edit (added actual filenames):
PRISM_tmin_stable_4kmM2_200001_bil.bil
PRISM_tmin_stable_4kmM2_200002_bil.bil
Code:
Filenames <- list.files("/../directory", pattern = "//.bil")
This will select all*.bil files but there are hundreds, so I need to specify only year 2000.
Any ideas?
The list.files command has the options for wildcards, so you should be able to do something like:
list.files("/../directory", pattern = "*_2000*//.bil")
or maybe
list.files("/../directory", pattern = ".*_2000.*\\.bil")
I'm not 100% clear on whether list.files uses a regex pattern and I don't have access to R at the moment, so let me know if that works.
This should also work, to iterate through the PRISM folders, and only pull out the .bil pattern (you need to keep the other files in the same folder so that it understands the raster data the .bil file comes with). recursive=T allows you to pull from multiple folders in your path (or directory), and by setting pattern you will only pull out files with JUST the .bil extension (not asc.bil, etc).
filenames <- list.files(path="PRISM",
recursive=T,
pattern="\\.bil$"
,full.names=T)
You can add the above code in with the details above specifying the year 2000.

Reading a file into R with partly unknown filename

Is there a way to read a file into R where I do not know the complete file name. Something like.
read.csv("abc_*")
In this case I do not know the complete file name after abc_
If you have exactly one file matching your criteria, you can do it like this:
read.csv(dir(pattern='^abc_')[1])
If there is more than one file, this approach would just use the first hit. In a more elaborated version you could loop over all matches and append them to one dataframe or something like that.
Note that the pattern uses regular expressions and thus is a bit different from what you did expect (and what I wrongly assumed at my first shot to answer the question). Details can be found using ?regex
If you have a directory you want to submit, you have do modify the dir command accordingly:
read.csv(dir('path/to/your/file', full.names=T, pattern="^abc"))
The submitted path in your case may be c:\\users\\user\\desktop, and then the pattern as above. full.names=T forces dir() to output a whole path and not only the file name. Try running dir(...) without the read.csv to understand what is happening there.
If you want to give your path as a complete string, it again gets a bit more complicated:
filepath <- 'path/to/your/file/abc_'
read.csv(dir(dirname(filepath), full.names=T, pattern=paste("^", basename(filepath), sep='')))
That process will fail if your filename contains any regular expression keywords. You would have to substitute then with their corresponding escape sequences upfront. But that again is another topic.

Resources