choose file names in R satisfying some criteria - r

I store many files in a directory but I only need some of them. The files I need all contain transcript_counts so that I am thinking if R has a function that help me pick up those file names with transcript_counts. For example, by using dir() I can see a list of file names:
[1] "xx1_sequence_alignment.csv"
[2] "xx2_sequence_transcript_counts.csv"
[3] "xx3_sequence_alignment.csv"
[4] "xx4_sequence_transcript_counts.csv"
[5] "xx5_sequence_alignment.csv"
Now I want to have a list containing only xx2_sequence_transcript_counts.csv, xx4_sequence_transcript_counts.csv and so on with transcript_counts as identifiers. Thanks.

Use the pattern argument
dir(pattern="transcript_counts")
From ?dir
pattern: an optional regular expression. Only file names which match the regular expression will be returned.
If you already have a character vector you can use grep to get the elements you want.
x <- dir()
grep("transcript_counts", x, value=TRUE)

Related

How can I read in excel files by looking for a string pattern?

I need to read in a bunch of files that are scattered across different directories.
The problem is, these files all have slightly different naming variations such as:
7-2018 RECON.xlsx
RECON 06-2019.xlsx
5-31-2017 RECON LINKED.xlsx
I want to read in excel files to look for the keyword "RECON" in the file name.
I tried using the contains function in the read_excel function - didn't work.
Any ideas?
Thanks!
You could identify the list of relevant files and get their paths with something like:
> normalizePath(list.files(pattern="Rmd", ignore.case=TRUE, recursive=TRUE))
[1] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture1/Lecture1.Rmd"
[2] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture2/lec2_inclass.Rmd"
[3] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture2/lecture2.rmd"
[4] "/Users/david/Dropbox (DaveArmstrong)/9590/Lecture3/lecture3.rmd"
You would probably want a pattern like ".*RECON.*\\.xlsx$" which would find <anything>RECON<anything>.xlsx<end of string>. You could save the result as a vector of file names and then loop over them to read them in.

Is there an R function for extracting a number from a path name?

I have a list of files to read in from a folder. Some of them are in the csv format and some of them are in the pp format. I have two separate functions to read the files in based on which format it is.
Currently I have list of the files using the list.files command. I'd like to make a dictionary mapping an ID to its corresponding file name, for example 100 maps to /Users/Bob/Box/here/is/some/data/100.csv while 50 maps to /Users/Bob/Box/here/is/some/data/50.pp.
I'm looping through another set of ID's so the purpose of the dictionary is supposed to make it easier to extract the pathname corresponding to an ID. Is there a way to construct this dictionary? I was thinking maybe if there was a way to automatically extract the ID from the pathname while processing the folder?
I found a solution here: https://stla.github.io/stlapblog/posts/Numextract.html
To extract number from a character vector, you can use the library stringr and create a function to extract number:
> library(stringr)
> numextract <- function(string){str_extract(string, "[-+]?[0-9]*\\.?[0-9]+")
}
> numextract("30.5ml")
[1] "30.5"
# With your example:
> numextract("/Users/Bob/Box/here/is/some/data/++---100.csv")
[1] "-100."
# you can add 'as.numeric' to extract only numerical value
> as.numeric(numextract("/Users/Bob/Box/here/is/some/data/++---100.csv"))
[1] -100
Is it what you are looking for ?

R - how to find exact pattern when listing files

I have a number of files from which I would like to find only the ones that match an exact pattern.
When I run:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
files <- list.files(idir, pattern=mods[1])
I got the results:
> files
[1] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc"
[2] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc"
[4] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc"
[5] "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[6] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
which is wrong, because I just wanted the last three names (which match the EXACT pattern I wish).
Even if I use regex to create the pattern, I will get a empty vector as result:
files <- list.files(idir, pattern=paste("^",m[1],"$", sep=''), full.names=T)
> files
character(0)
What am I missing here?
Thanks!
Your solution works, the first three files also have the pattern GISS-E2-H.
To get only the last three, you can do as suggested by #G.Grothendieck and add the _ to mods:
mods=c('GISS-E2-H_','GISS-E2-R','GISS-E2-R-CC')
Now to test your solution I'll create the files:
allfiles <- c("clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc")
for (file in allfiles) {
write("empty file", file)
}
Now it works:
> list.files(getwd(), pattern=mods[1])
[1] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc" "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
Edit:
An alternative is as originally proposed, and instead of replacing mods you can append the _ inside list.files:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC') #Original
list.files(getwd(), pattern=paste0(mods[1], "_"))
I would use this with caution, though. If you turn this into some kind of loop to also read the other file patterns in mods, the _ will be appended to all patterns, making them possibly incorrect.
Try this:
files <- list.files(idir, pattern = ".*GISS-E2-Hd.*")
Your original vector of patterns was this:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
which was trying to match exactly files called GISS-E2-H etc. Since those files do not exits in your idir you were getting back character(0).

Using pattern to select files that has XX at any part the name in R

I have a folder full of files which names are like these, for example:
[1] "./final_model_pre/pre_pe_ja_bc_wm.tif" "./final_model_pre/pre_pe_ja_bc_wm.tif.aux.xml"
[3] "./final_model_pre/pre_pe_ja_bc_wm.tif.ovr" "./final_model_pre/pre_an_le_glm_wm.tif"
[5] "./final_model_pre/pre_an_le_glm_wm.tif.aux.xml" "./final_model_pre/pre_an_le_glm_wm.tif.ovr"
[7] "./final_model_pre/pre_an_bo_ma_wm.tif" "./final_model_pre/pre_an_bo_ma_wm.tif.aux.xml"
[9] "./final_model_pre/pre_an_bo_ma_wm.tif.ovr" "./final_model_pre/pre_pe_ja_mx_wm.tif"
[11] "./final_model_pre/pre_pe_ja_mx_wm.tif.aux.xml" "./final_model_pre/pre_pe_ja_mx_wm.tif.ovr"
[13] "./final_model_pre/pre_pe_ja_rf1_wm.tif" "./final_model_pre/pre_pe_ja_rf1_wm.tif.aux.xml"
[15] "./final_model_pre/pre_pe_ja_rf1_wm.tif.ovr" "./final_model_pre/pre_pe_ja_svm_wm.tif"
[17] "./final_model_pre/pre_pe_ja_svm_wm.tif.aux.xml" "./final_model_pre/pre_pe_ja_svm_wm.tif.ovr"
I want to list every files that have "pe_ja" in the name and only with ".tif" extension, not ".tif.ovr" or ".tif.aux.xml" or any other extension. I'm trying to use list.files function, but I couldn't manage to use the pattern agrument properly. Could you help me doing that?
Thank you!
P.S.: I'm using R.
You can use a regular expression for that.
files = list.files(pattern = '.*pe_ja.*\\.tif$')
The $ at the end of the regular expression indicates that that is the end of the string. The \\. is an escaped period, indicating that you want to match a period (not any character, which is what . normally matches).
The .* selects any character any number of times (including 0).
try this one files = list.files(pattern = '.*pe_ja.*\.tif$') note that in order to escape the . you need this expression \. not \\. , this expression \\.
means escape \ character and match any character after . except new line
try this one out Regex101 test file names

Recognize arbitrary file extensions in R?

I'm writing a function in R that will take the path name of a folder as its argument and return a vector containing the names of all the files in that folder which have the extension ".pvalues".
myFunction <- function(path) {
# return vector that contains the names of all files
# in this folder that end in extension ".pvalues"
}
I know how to get the names of the files in the folder, like so:
> list.files("/Users/me/myfolder/")
[1] "myfile.txt"
[2] "myfile.txt.a"
[3] "myfile.txt.b"
[4] "myfile.txt.a.pvalues"
[5] "myfile.txt.b.pvalues"
Is there an easy way to identify all the files in this folder that end in ".pvalues"? I cannot assume that the names will start with "myfile". They could start with "yourfile", for instance.
take a look at ?list.files. You want the pattern argument. list.files(path='/Users/me/myfolder', pattern='*\\.pvalues$')

Resources