List files starting with a specific character - r

I have a folder which contains files of the following names "5_45name.Rdata"and "15_45name.Rdata".
I would like to list only the ones starting with "5", in the example above this means that I would like to exclude "15_45name.Rdata".
Using list.files(pattern="5_*name.Rdata"), however, will list both of these files.
Is there a way to tell list.files() that I want the filename to start with a specific character?

We need to use the metacharacter (^) to specify the start of the string followed by the number 5. So, it can a more specific pattern like below
list.files(pattern ="^5[0-9]*_[0-9]*name.Rdata")
Or concise if we are not concerned about the _ and other numbers following it.
list.files(pattern = "^5.*name.Rdata")

Related

Renaming multiple files by keeping first 6 characters

How to rename file name from Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs to Pbs_d7_s2.fcs
For multiple files keeping in mind that _juliam_08July2020_02_1_0_live singlets is not the same for all files?
It's a bit unclear what you're asking for, but it looks like you only want to keep the chunks before the third underscore. If so, you can tackle this with regular expressions. The regular expression
str_extract(input_string, "^(([^_])+_){3}")
will take out the first 3 blocks of characters (that aren't underscores) that end in underscores. The first ^ "anchors" the match to the beginning of the string, the "[^_]+_" matches any number of non-underscore characters before an underscore. The {3} does the preceding operation 3 times.
So for "Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs" you'll end up with "Pbs_d7_s2_". Now you just replace the last underscore with ".fcs" like so
str_replace(modified string, "_$", ".fcs")
The $ "anchors" the characters that precede it to the end of the string so in this case it's replacing the last underscore. The full sequence is
string1<- "Pbs_d7_s2_juliam_08July2020_02_1_0_live singlets.fcs"
str_extract(string1, "^(([^_])+_){3}") %>%
str_replace("_$",".fcs")
[1] "Pbs_d7_s2.fcs"
Now let's assume your filenames are in a vector named stringvec.
output <- vector("character",length(stringvec))
for (i in seq_along(stringvec)) {
output[[i]] <- str_extract(stringvec[[i]],"^(([^_])+_){3}")%>%
str_replace("_$",".fcs")
}
output
I'm making some assumptions here - namely that the naming convention is the same for all of your files. If that's not true you'll need to find ways to modify the regex search pattern.
I recommend this answer How do I rename files using R? for replacing vectors of file names. If you have a vector of original file names you can use my for loop to generate a vector of new names, and then you can use the information in the link to replace one with the other. Perhaps there are other solutions not involving for loops.

In R, read files from folder in a list and assign list element names by the file names w/o file format (.fa)

I´m making a list of fasta files and read them from a folder. The file name should be assigned as list element name w/o the .fa file format.
I´m using list.files to asses the files in the directory "Folder"
filenames <- list.files("Folder",pattern = ".fa",full.names = T)
and than read the fasta files in.
list <- lapply(filenames, FUN=readDNAStringSet, use.names=T, format="fasta")
I found this code using setNames to define the list element name.
list<- setNames(list, substr(list.files("Folder", pattern=".fa"), 1,15 ))
But my file names have different length (makes it difficult to use the START to STOP (,1, 15)) and for further processing I would like to get rid of the .fa
The files would look like:
Gene1.fa
Gene12.fa
Gene22a.fa
Gene123abc.fa
I´m using DECIPHER but I guess this is a more base R question?
Inorder to remove the substring at the end, we could use substr as well, but make sure to index the first/last from the end instead from the beginning as it is varying
v1 <- list.files("Folder", pattern=".fa")
substring(v1, first = 1, last = nchar(v1) -3)
#[1] "Gene1" "Gene12" "Gene22a" "Gene123abc"
Or another option is sub to match the dot (. - metacharacter that matches for any character, so escape (\\) it to get the literal meaning) followed by 'fa' at the end ($) of the string and replace it with blank ("")
sub("\\.fa$", "", v1)

Reading multiple csv files from a folder with R using regex

I wish to use R to read multiple csv files from a single folder. If I wanted to read every csv file I could use:
list.files(folder, pattern="*.csv")
See, for example, these questions:
Reading multiple csv files from a folder into a single dataframe in R
Importing multiple .csv files into R
However, I only wish to read one of four subsets of the files at a time. Below is an example grouping of four files each for three models.
JS.N_Nov6_2017_model220_N200.csv
JS.N_Nov6_2017_model221_N200.csv
JS.N_Nov6_2017_model222_N200.csv
my.IDs.alt_Nov6_2017_model220_N200.csv
my.IDs.alt_Nov6_2017_model221_N200.csv
my.IDs.alt_Nov6_2017_model222_N200.csv
parms_Nov6_2017_model220_N200.csv
parms_Nov6_2017_model221_N200.csv
parms_Nov6_2017_model222_N200.csv
supN_Nov6_2017_model220_N200.csv
supN_Nov6_2017_model221_N200.csv
supN_Nov6_2017_model222_N200.csv
If I only wish to read, for example, the parms files I try the following, which does not work:
list.files(folder, pattern="parm*.csv")
I am assuming that I may need to use regex to read a given group of the four groups present, but I do not know.
How can I read each of the four groups separately?
EDIT
I am unsure whether I would have been able to obtain the solution from answers to this question:
Listing all files matching a full-path pattern in R
I may have had to spend a fair bit of time brushing up on regex to apply those answers to my problem. The answer provided below by Mako212 is outstanding.
A quick REGEX 101 explanation:
For the case of matching the beginning and end of the string, which is all you need to do here, the following prinicples apply to match files that are .csv and start with parm:
list.files(folder, pattern="^parm.*?\\.csv")
^ asserts we're at the beginning of the string, so ^parm means match parm, but only if it's at the beginning of the string.
.*? means match anything up until the next part of the pattern matches. In this case, match until we see a period \\.
. means match any character in REGEX, so we need to escape it with \\ to match the literal . (note that in R you need the double escape \\, in other languages a single escape \ is sufficienct).
Finally csv means match csv after the .. If we were going to be really thorough, we might use \\.csv$ using the $ to indicate the end of the string. You'd need the dollar sign if you had other files with an extension like .csv2. \\.csv would match .csv2, where as \\.csv$ would not.
In your case, you could simply replace parm in the REGEX pattern with JS, my, or supN to select one of your other file types.
Finally, if you wanted to match a subset of your total file list, you could use the | logical "or" operator:
list.files(folder, pattern = "^(parm|JS|supN).*?\\.csv")
Which would return all the file names except the ones that start with my
The list.files statement shown in the question is using globs but list.files accepts regular expressions, not globs.
Sys.glob To use globs use Sys.glob like this:
olddir <- setwd(folder)
parm <- lapply(Sys.glob("parm*.csv"), read.csv)
parm is now a list of data frames read in from those files.
glob2rx Note that the glob2rx function can be used to convert globs to regular expressions:
parm <- lapply(list.files(folder, pattern = glob2rx("parm*.csv")), read.csv)

Subset according to patterns in file names

I have the following file names in a folder:
1_myfile.txt, 2_myfile.txt, 3_myfile.txt, and 4_best_myfile.txt, 5_best_myfile.txt, 6_best_myfile.txt.
I would like to use regex in pattern = "" when listing files with list.files() in order to subset files containing "_myfile.txt" from files containing "_best_myfile.txt". I tried using:
files = list.files(path = ".", "*[^best_myfile.txt]$")
Unfortunately it does not work because it subsets only files that do not end with .txt. How can I solve this?
We can modify the pattern to "\\d+_best_myfile\\.txt"
files <- list.files("\\d+_best_myfile\\.txt")
It implies one or more numbers (\\d+) followed by a _ and the string best_myfile.txt. Also, note that some characters needs to be escaped i.e. . is a metacharacter and it implies any character. So, to get the literal dot character, we need to escape (\\) it

Using R to list all files with a specified extension

I'm very new to R and am working on updating an R script to iterate through a series of .dbf tables created using ArcGIS and produce a series of graphs.
I have a directory, C:\Scratch, that will contain all of my .dbf files. However, when ArcGIS creates these tables, it also includes a .dbf.xml file. I want to remove these .dbf.xml files from my file list and thus my iteration. I've tried searching and experimenting with regular expressions to no avail. This is the basic expression I'm using (Excluding all of the various experimentation):
files <- list.files(pattern = "dbf")
Can anyone give me some direction?
files <- list.files(pattern = "\\.dbf$")
$ at the end means that this is end of string. "dbf$" will work too, but adding \\. (. is special character in regular expressions so you need to escape it) ensure that you match only files with extension .dbf (in case you have e.g. .adbf files).
Try this which uses globs rather than regular expressions so it will only pick out the file names that end in .dbf
filenames <- Sys.glob("*.dbf")
Peg the pattern to find "\\.dbf" at the end of the string using the $ character:
list.files(pattern = "\\.dbf$")
Gives you the list of files with full path:
Sys.glob(file.path(file_dir, "*.dbf")) ## file_dir = file containing directory
I am not very good in using sophisticated regular expressions, so I'd do such task in the following way:
files <- list.files()
dbf.files <- files[-grep(".xml", files, fixed=T)]
First line just lists all files from working dir. Second one drops everything containing ".xml" (grep returns indices of such strings in 'files' vector; subsetting with negative indices removes corresponding entries from vector).
"fixed" argument for grep function is just my whim, as I usually want it to peform crude pattern matching without Perl-style fancy regexprs, which may cause surprise for me.
I'm aware that such solution simply reflects drawbacks in my education, but for a novice it may be useful =) at least it's easy.

Resources