R - how to find exact pattern when listing files - r

I have a number of files from which I would like to find only the ones that match an exact pattern.
When I run:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
files <- list.files(idir, pattern=mods[1])
I got the results:
> files
[1] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc"
[2] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc"
[4] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc"
[5] "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[6] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
which is wrong, because I just wanted the last three names (which match the EXACT pattern I wish).
Even if I use regex to create the pattern, I will get a empty vector as result:
files <- list.files(idir, pattern=paste("^",m[1],"$", sep=''), full.names=T)
> files
character(0)
What am I missing here?
Thanks!

Your solution works, the first three files also have the pattern GISS-E2-H.
To get only the last three, you can do as suggested by #G.Grothendieck and add the _ to mods:
mods=c('GISS-E2-H_','GISS-E2-R','GISS-E2-R-CC')
Now to test your solution I'll create the files:
allfiles <- c("clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc")
for (file in allfiles) {
write("empty file", file)
}
Now it works:
> list.files(getwd(), pattern=mods[1])
[1] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc" "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
Edit:
An alternative is as originally proposed, and instead of replacing mods you can append the _ inside list.files:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC') #Original
list.files(getwd(), pattern=paste0(mods[1], "_"))
I would use this with caution, though. If you turn this into some kind of loop to also read the other file patterns in mods, the _ will be appended to all patterns, making them possibly incorrect.

Try this:
files <- list.files(idir, pattern = ".*GISS-E2-Hd.*")
Your original vector of patterns was this:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
which was trying to match exactly files called GISS-E2-H etc. Since those files do not exits in your idir you were getting back character(0).

Related

How can I read in excel files by looking for a string pattern?

I need to read in a bunch of files that are scattered across different directories.
The problem is, these files all have slightly different naming variations such as:
7-2018 RECON.xlsx
RECON 06-2019.xlsx
5-31-2017 RECON LINKED.xlsx
I want to read in excel files to look for the keyword "RECON" in the file name.
I tried using the contains function in the read_excel function - didn't work.
Any ideas?
Thanks!
You could identify the list of relevant files and get their paths with something like:
> normalizePath(list.files(pattern="Rmd", ignore.case=TRUE, recursive=TRUE))
[1] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture1/Lecture1.Rmd"
[2] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture2/lec2_inclass.Rmd"
[3] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture2/lecture2.rmd"
[4] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture3/lecture3.rmd"
You would probably want a pattern like ".*RECON.*\\.xlsx$" which would find <anything>RECON<anything>.xlsx<end of string>. You could save the result as a vector of file names and then loop over them to read them in.

How to rename part of a file

I would to to rename part of a file name, because the structure is hardcoded in getfiles.
I have metabolomics mzML files containing ltQCs, sQCs and samples, but the name of the files have different lenghts (6,6,7).I am trying to run XCMS, but it only picks up ltQCs and sQCs, because the structure is hardcoded to 6. How do I change the structure of the filename see example below:
2020-02-02_B1W1_RP_NEG_P7_A20_001.mzML (structure of 7)
to
2020-02-02_B1W1_RP_NEG_P7A20_001.mzML (structure of 6)
I have higlighted the part that I would like to change. If this is impossible, maybe renaming the ltQCs and sQCs may be easier by adding a letter or number, so I get a structure of 7 and then change the structure in getfiles to 7.
Hope somebody can help, thank you:)
Best
You can change the file names with a regular expression using gsub which removes the penultimate underline
my_regex <- "(_)([[:alnum:]]{3}_[[:alnum:]]{3}[.]mzML)"
my_filename <- "2020-02-02_B1W1_RP_NEG_P7_A20_001.mzML"
gsub(my_regex, "\\2", my_filename)
#> [1] "2020-02-02_B1W1_RP_NEG_P7A20_001.mzML"
So you could do something like
rename_mzMLs <- function(directory)
{
filenames <- list.files(directory, pattern = ".mzML")
my_regex <- "(_)([[:alnum:]]{3}_[[:alnum:]]{3}[.]mzML)"
new_filenames <- gsub(my_regex, "\\2", filenames)
file.rename(filenames, new_filenames)
}
And run it by doing
rename_mzMLs("C:/path/to/mzML/files/")
Obviously, I can't test this since I don't have any mzML files, so ensure you back up your files before running this function!

Extract segment of filename

I'm trying to extract a filename and save the dataframe with that same name.
The problem I have is that if the filename for some reason is inside a folder with a similar word, stringr will return that word as well.
filename <- "~folder/testdata/2016/testdata 2016.csv"
If I run this:
library(stringr)
str <- str_trim(stringr::str_extract(filename,"[t](.*)"), "left") it returns testdata/2016/testdata 2016.csv when all I want is testdata 2016. Optimally it would even be better to get testdata2016.
I've been trying several combinations but there has to be a simpler way of doing this. If there was a way of reading the path from right to left, starting at .csv stop at /, I wouldn't have this issue.
You can have below approaches:
library(stringr)
str_replace(str_extract(filename,"\\w*\\s+\\w*(?=\\.)"),"\\s+","")
str_replace_all(basename(filename),"\\s+|\\.csv","")
You can use basename approach as suggested by Benjamin.
?basename:
basename removes all of the path up to and including the last path
separator (if any).
Output:
[1] "testdata2016"
Plenty of help in base R (tools pkg comes with the default R install):
gsub(" ", "",
tools::file_path_sans_ext(
basename("~folder/testdata/2016/testdata 2016.csv")))

How to modify i in an R loop?

I have several large R objects saved as .RData files: "this.RData", "that.RData", "andTheOther.RData" and so on. I don't have enough memory, so I want to load each in a loop, extract some rows, and unload it. However, once I load(i), I need to strip the ".RData" part of (i) before I can do anything with objects "this", "that", "andTheOther". I want to do the opposite of what is described in How to iterate over file names in a R script? How can I do that? Thx
Edit: I omitted to mention the files are not in the working directory and have a filepath as well. I came across Getting filename without extension in R and file_path_sans_ext takes out the extension but the rest of the path is still there.
Do you mean something like this?
i <- c("/path/to/this.RDat", "/another/path/to/that.RDat")
f <- gsub(".*/([^/]+)", "\\1", i)
f1 <- gsub("\\.RDat", "", f)
f1
[1] "this" "that"
On windows' paths you have to use "\\" instead of "/"
Edit: Explanation. Technically, these are called "regular
expressions" (regexps), not "patterns".
. any character
.* arbitrary number (including 0) of any kind of characters
.*/ arbitrary number of any kind of characters, followed by a
/
[^/] any character but not /
[^/]+ arbitrary number (1 or more) of any kind of characters,
but not /
( and ) enclose groups. You can use the groups when
replacing as \\1, \\2 etc.
So, look for any kind of character, followed by /, followed by
anything but not the path separator. Replace this with the "anything
but not separator".
There are many good tutorials for regexps, just look for it.
A simple way to do this using would be to extract the base name from the filepaths with base::basename() and then remove the file extension with tools::file_path_sans_ext().
paths_to_files <- c("./path/to/this.RData", "./another/path/to/that.RData")
tools::file_path_sans_ext(
basename(
paths_to_files
)
)
## Returns:
## [1] "this" "that"

choose file names in R satisfying some criteria

I store many files in a directory but I only need some of them. The files I need all contain transcript_counts so that I am thinking if R has a function that help me pick up those file names with transcript_counts. For example, by using dir() I can see a list of file names:
[1] "xx1_sequence_alignment.csv"
[2] "xx2_sequence_transcript_counts.csv"
[3] "xx3_sequence_alignment.csv"
[4] "xx4_sequence_transcript_counts.csv"
[5] "xx5_sequence_alignment.csv"
Now I want to have a list containing only xx2_sequence_transcript_counts.csv, xx4_sequence_transcript_counts.csv and so on with transcript_counts as identifiers. Thanks.
Use the pattern argument
dir(pattern="transcript_counts")
From ?dir
pattern: an optional regular expression. Only file names which match the regular expression will be returned.
If you already have a character vector you can use grep to get the elements you want.
x <- dir()
grep("transcript_counts", x, value=TRUE)

Resources