Recognize arbitrary file extensions in R? - r

I'm writing a function in R that will take the path name of a folder as its argument and return a vector containing the names of all the files in that folder which have the extension ".pvalues".
myFunction <- function(path) {
# return vector that contains the names of all files
# in this folder that end in extension ".pvalues"
}
I know how to get the names of the files in the folder, like so:
> list.files("/Users/me/myfolder/")
[1] "myfile.txt"
[2] "myfile.txt.a"
[3] "myfile.txt.b"
[4] "myfile.txt.a.pvalues"
[5] "myfile.txt.b.pvalues"
Is there an easy way to identify all the files in this folder that end in ".pvalues"? I cannot assume that the names will start with "myfile". They could start with "yourfile", for instance.

take a look at ?list.files. You want the pattern argument. list.files(path='/Users/me/myfolder', pattern='*\\.pvalues$')

Related

R: Deleting Files in R based on their Names

I am working with the R programming language.
I found the following related question Stackoverflow (how to delete a file with R?) which shows you how to delete a file having a specific name from the working directory:
#Define the file name that will be deleted
fn <- "foo.txt"
#Check its existence
if (file.exists(fn)) {
#Delete file if it exists
file.remove(fn)
}
[1] TRUE
My Question: Is it possible to delete files based on whether the file name contains a specific combination of letters (i.e. LIKE 'fo%' )? This way, all files in the working directory starting with the letters "fo" will be deleted.
What I tried so far:
I thought of a way where I could first create a list of all files in the working directory that I want to delete based on their names:
# create list of all files in working directory
a = getwd()
path.to.csv <- a
files<-list.files(path.to.csv)
my_list = print(files) ## list all files in path
#identify files that match the condition
to_be_deleted = my_list[grepl("fo",unlist(my_list))]
Then, I tried to deleted this file using the command used earlier:
if (file.exists(to_be_deleted)) {
#Delete file if it exists
file.remove(to_be_deleted)
}
This returned the following message:
[1] TRUE TRUE TRUE TRUE TRUE TRUE
Warning message:
In if (file.exists(to_be_deleted)) { :
the condition has length > 1 and only the first element will be used
Does anyone know if I have done this correctly? Suppose if there were multiple files in the working directory where the names of these files started with "fo" - would all of these files have been deleted? Or only the first file in this list?
Can someone please show me how to do this correctly?
Thanks!
file.remove accepts a list of file to delete.
Regarding file.exists, it also accepts a list, but it will return a list of logical values, one for each file. And this won't work with if, which requires only one logical value.
However, you don't need to check the existence of files that you get from list.files: they obviously exist.
So, the simplest is to remove the if test and just call file.remove:
files <- list.files(path, pattern = "fo")
to_be_deleted <- grep("fo", files, value = T)
file.remove(to_be_deleted)
Or even simpler:
to_be_deleted <- list.files(path, pattern = "fo")
file.remove(to_be_deleted)
A few notes however:
Here you don't know in advance if you have the right to delete these
files.
You don't know either if the names are indeed files, or
directory (or something else). It's tempting to believe that
file.exists answer the second question, that is, it might tell you
that a name is a real file, but actually it does not:
file.exists(path) returns TRUE also when path is a directory.
However you can detect directories with dir.exists(path). Depending
on your specific case, it may or may not be necessary to check for
this (for instance, if you know the pattern passed to grep always
filters files, it's ok).

How can I read in excel files by looking for a string pattern?

I need to read in a bunch of files that are scattered across different directories.
The problem is, these files all have slightly different naming variations such as:
7-2018 RECON.xlsx
RECON 06-2019.xlsx
5-31-2017 RECON LINKED.xlsx
I want to read in excel files to look for the keyword "RECON" in the file name.
I tried using the contains function in the read_excel function - didn't work.
Any ideas?
Thanks!
You could identify the list of relevant files and get their paths with something like:
> normalizePath(list.files(pattern="Rmd", ignore.case=TRUE, recursive=TRUE))
[1] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture1/Lecture1.Rmd"
[2] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture2/lec2_inclass.Rmd"
[3] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture2/lecture2.rmd"
[4] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture3/lecture3.rmd"
You would probably want a pattern like ".*RECON.*\\.xlsx$" which would find <anything>RECON<anything>.xlsx<end of string>. You could save the result as a vector of file names and then loop over them to read them in.

Sys.glob () within unzip ()

TLDNR: How do I use Sys.glob () within unzip ()?
I have multiple .zip files and I want to extract only one file from each archive.
For example, one of the archives contains the following files:
[1] "cmc-20150531.xml" "cmc-20150531.xsd" "cmc-20150531_cal.xml" "cmc-20150531_def.xml" "cmc-20150531_lab.xml"
[6] "cmc-20150531_pre.xml"
I want to extract the first file because it matches a pattern. In order to do that I use the following command:
unzip("zip-archive.zip", files=Sys.glob("[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"))
However, the command doesn't work, and I don't know why. R just extracts all files in the archive.
On the other hand, the following command works:
unzip("zip-archive.zip", files="cmc-20150531.xml")
How do I use Sys.glob() within unzip()?
Sys.glob expands files that already exist. So the parameter to your unzip call will depend on what files are in your working directory.
Perhaps you want to do unzip with list=TRUE to return the list of files in the zip first, and then use some pattern matching to select the files you want.
See ?grep for info on matching strings with patterns. These patterns are "regular expressions" rather than "glob" expansions, but you should be able to work with that.
Here's a concrete example:
# whats in the zip?
files = unzip("c.zip", list=TRUE)$Name
files
[1] "l_spatial.dbf" "l_spatial.shp" "l_spatial.shx" "ls_polys_bin.dbf"
[5] "ls_polys_bin.shp" "ls_polys_bin.shx" "rast_jan90.tif"
# what files have "dbf" in them:
files[grepl("dbf",files)]
[1] "l_spatial.dbf" "ls_polys_bin.dbf"
# extract just those:
unzip("c.zip", files=files[grepl("dbf",files)])
The regular expression for your glob
"[a-z][a-z][a-z][-][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][.][x][m][l]"
would be
"^[a-z]{3}-[0-9]{8}\\.xml$"
that's a match of start of string ("^"), 3 a-z (lower case only), a dash, eight digits, a dot (backslashes are needed, one because dot means "any one char" in regexps and another because R needs a backslash to escape a backslash), "xml", and the end of the string ("$").
Just with any other collections do an itertive loop through the results from Sys.glob and supply the itertive holding variable to unzip. This is achieved by using a for-loop
While unzip() takes an argument for the path, and files is an arugment for what files within that zip file.
Mined you I'm more a full stack programmer not so much so on the R lang, but the concepts are the same; so the code should something like:
files <- Sys.glob(path_expand(".","*.zip"))
for (idx in 1:length(files)) {
results = unzip(files[idx], "*.xml")
}
As for using regex in unzip() that is something one should read the documentation. I could only advise doing another for-loop to compare the contest of the zip file to your regex then preforming the extraction. Psudocode follows:
files ::= glob(*.zip)
regex ::=
for idx1 in length(files); do
regex="[a-z]{3}\-[0-9]{8}\.xml"
content = unzip(files[idx1])
for idx2 in length(content); do
if content[idx2].name ~= regex.expand(); then
# do something with found file
end if
end for
end for
Basically your just looping through your list of zip files, then through the list of files within the zip file and comparing the filename from inside your zipfile agenst the regex and extracting/preforming operations on only that file.

R - how to find exact pattern when listing files

I have a number of files from which I would like to find only the ones that match an exact pattern.
When I run:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
files <- list.files(idir, pattern=mods[1])
I got the results:
> files
[1] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc"
[2] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc"
[4] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc"
[5] "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[6] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
which is wrong, because I just wanted the last three names (which match the EXACT pattern I wish).
Even if I use regex to create the pattern, I will get a empty vector as result:
files <- list.files(idir, pattern=paste("^",m[1],"$", sep=''), full.names=T)
> files
character(0)
What am I missing here?
Thanks!
Your solution works, the first three files also have the pattern GISS-E2-H.
To get only the last three, you can do as suggested by #G.Grothendieck and add the _ to mods:
mods=c('GISS-E2-H_','GISS-E2-R','GISS-E2-R-CC')
Now to test your solution I'll create the files:
allfiles <- c("clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc")
for (file in allfiles) {
write("empty file", file)
}
Now it works:
> list.files(getwd(), pattern=mods[1])
[1] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc" "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
Edit:
An alternative is as originally proposed, and instead of replacing mods you can append the _ inside list.files:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC') #Original
list.files(getwd(), pattern=paste0(mods[1], "_"))
I would use this with caution, though. If you turn this into some kind of loop to also read the other file patterns in mods, the _ will be appended to all patterns, making them possibly incorrect.
Try this:
files <- list.files(idir, pattern = ".*GISS-E2-Hd.*")
Your original vector of patterns was this:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
which was trying to match exactly files called GISS-E2-H etc. Since those files do not exits in your idir you were getting back character(0).

choose file names in R satisfying some criteria

I store many files in a directory but I only need some of them. The files I need all contain transcript_counts so that I am thinking if R has a function that help me pick up those file names with transcript_counts. For example, by using dir() I can see a list of file names:
[1] "xx1_sequence_alignment.csv"
[2] "xx2_sequence_transcript_counts.csv"
[3] "xx3_sequence_alignment.csv"
[4] "xx4_sequence_transcript_counts.csv"
[5] "xx5_sequence_alignment.csv"
Now I want to have a list containing only xx2_sequence_transcript_counts.csv, xx4_sequence_transcript_counts.csv and so on with transcript_counts as identifiers. Thanks.
Use the pattern argument
dir(pattern="transcript_counts")
From ?dir
pattern: an optional regular expression. Only file names which match the regular expression will be returned.
If you already have a character vector you can use grep to get the elements you want.
x <- dir()
grep("transcript_counts", x, value=TRUE)

Resources