File renaming (string substitution) without a clear pattern using R - r

Currently, I am working with a long list of files.
They have a name pattern of SB_xxx_(parts). (different extensions), where xxx refers to an item code.
SB_19842.png
SB_19842_head.png
SB_19842_hand.png
SB_19842_head.pdf
...
It is found that many of these codes have incorrect entries.
I got two columns in hand: One is for old codes and one is new codes (let's say A & B). I hope to change all those old codes in the file names to the new code.
old new
12154 24124
92482 02425
.....
My first thought is to use file.rename()
However, it is a one-to-one changing approach. I cannot do this because every item has a different number of parts and different file extensions.
Is there any recursive method that can simply change all incorrect file names with strings in A and replace them with strings in B? Anyone get an idea, please?

A loop solution with purrr::map2 at the end:
library(purrr)
#create files to rename
file.create("SB_19842.png")
file.create("SB_19842_head.png")
file.create("SB_19842_hand.png")
file.create("SB_19842_head.pdf")
file.create("SB_12154.png")
file.create("SB_12154_head.png")
file.create("SB_12154_hand.png")
file.create("SB_12154_head.pdf")
# a dataframe with old a nd new patterns
file_names <- data.frame(
old = c("19842", "12154"),
new = c("new1", "new2")
)
# old filenames from the directory, specify path if needed
file_names_SB <- list.files(pattern = "SB_")
# function to substitute one type of code with another
sub_one_code <- function(old_code, new_code, file_names_original){
gsub(paste0("SB_", old_code), paste0("SB_", new_code), file_names_original)
}
# loop to substitute all codes
new_file_names <- file_names_SB
for (row in 1:nrow(file_names)){
new_file_names <- sub_one_code(file_names[row, "old"], file_names[row, "new"], new_file_names)
}
# rename all the files
map2(file_names_SB,
new_file_names,
file.rename)
#thelatemail provided a link with more elegant solutions for generating new file names.

Related

R: locating files that their names contain a specific string from a directory and match to my list of wanted files

It's me the newbie again with another messy file and folder situation(thanks to us biologiests): I got this directory containing a huge amount of .txt files (~900,000+), all the files have been previously handed with inconsistent naming format :(
For example, messy files in directory look like these:
ctrl_S978765_uns_dummy_00_none.txt
ctrl_S978765_3S_Cookie_00_none.txt
S59607_3S_goody_3M_V10.txt
ctrlnuc30-100_S3245678_DMSO_00_none.txt
ctrlRAP_S0846567_3S_Dex_none.txt
S6498432_2S_Fulra_30mM_V100.txt
.....
As you see the naming has no reliable consistency. What's important for me is the ID code embedded in them, such as S978765. Now I have got a list (100 ID codes) of these ID codes that I want.
The CSV file containing the list as below, mind you the list does have repetitive ID codes in the row due to different CLnumber value in the second columns:
ID code CLnumber
S978765 1
S978765 2
S306223 1
S897458 1
S514486 2
....
So I want to achieve below task: find all the messy named files using the code IDs by matching to my list. And copy them into a new directory.
I have thought of use list.files() to get all the .txt files and their names, then I got stuck at the next step at matching the code ID names, I know how to do it with one string, say "S978765", but if I do it one by one, this is almost just like manual digging the folder.
How could I feed the ID code names in column1 as a list and compare/match them with the messy file title names in the directory and then copy them into a new folder?
Many thanks,
ML
This works:
library(stringr)
# get this via list.files in your actual code
files <- c("ctrl_S978765_uns_dummy_00_none.txt",
"ctrl_S978765_3S_Cookie_00_none.txt",
"S59607_3S_goody_3M_V10.txt",
"ctrlnuc30-100_S3245678_DMSO_00_none.txt",
"ctrlRAP_S0846567_3S_Dex_none.txt",
"S6498432_2S_Fulra_30mM_V100.txt")
ids <- data.frame(`ID Code` = c("S978765", "S978765", "S306223", "S897458", "S514486"),
CLnumber = c(1, 2, 1, 1, 2),
stringsAsFactors = FALSE)
str_subset(files, paste(ids$ID.Code, collapse = "|"))
#> [1] "ctrl_S978765_uns_dummy_00_none.txt" "ctrl_S978765_3S_Cookie_00_none.txt"
str_subset takes a character vector and returns elements matching some pattern. In this case, the pattern is "S978765|S978765|S306223|S897458|S514486" (created by using paste), which is a regular expression that matches any of the ID codes separated by |. So we take files and keep only the elements that have a match in ID Code.
There are many other ways to do this, which may or may not be more clear. For example, you could pass ids$ID.Code directly to str_subset instead of constructing a regular expression via paste, but that would throw a warning about object lengths every time, which could get confusing (or cause problems if you get used to ignoring it and then ignore it in a different context where it matters). Another method would be to use purrr and keep, but while that might be a little bit more clear to write, it would be a lot more inefficient since it would mean making multiple passes over the files vector -- not relevant in this context, but possibly very relevant if you suddenly need to do this for hundreds of thousands of files and IDs.
You could use regex to extract the ID codes from the file name.
Here, I have used the pattern "S" followed by 5 or more numbers. Once we extract the ID_codes, we can compare them with the ones which we have in csv.
Assuming the csv is called df and the column name is ID_Codes we can use %in% to filter them.
We can then use file.copy to move files from one folder to another folder.
all_files <- list.files(path = '/Path/To/Folder', full.names = TRUE)
selected_files <- all_files[sub('.*(S\\d{5,}).*', '\\1', basename(all_files))
%in% unique(df$ID_Codes)]
file.copy(selected_files, 'new_path/for/files')

How to rename part of a file

I would to to rename part of a file name, because the structure is hardcoded in getfiles.
I have metabolomics mzML files containing ltQCs, sQCs and samples, but the name of the files have different lenghts (6,6,7).I am trying to run XCMS, but it only picks up ltQCs and sQCs, because the structure is hardcoded to 6. How do I change the structure of the filename see example below:
2020-02-02_B1W1_RP_NEG_P7_A20_001.mzML (structure of 7)
to
2020-02-02_B1W1_RP_NEG_P7A20_001.mzML (structure of 6)
I have higlighted the part that I would like to change. If this is impossible, maybe renaming the ltQCs and sQCs may be easier by adding a letter or number, so I get a structure of 7 and then change the structure in getfiles to 7.
Hope somebody can help, thank you:)
Best
You can change the file names with a regular expression using gsub which removes the penultimate underline
my_regex <- "(_)([[:alnum:]]{3}_[[:alnum:]]{3}[.]mzML)"
my_filename <- "2020-02-02_B1W1_RP_NEG_P7_A20_001.mzML"
gsub(my_regex, "\\2", my_filename)
#> [1] "2020-02-02_B1W1_RP_NEG_P7A20_001.mzML"
So you could do something like
rename_mzMLs <- function(directory)
{
filenames <- list.files(directory, pattern = ".mzML")
my_regex <- "(_)([[:alnum:]]{3}_[[:alnum:]]{3}[.]mzML)"
new_filenames <- gsub(my_regex, "\\2", filenames)
file.rename(filenames, new_filenames)
}
And run it by doing
rename_mzMLs("C:/path/to/mzML/files/")
Obviously, I can't test this since I don't have any mzML files, so ensure you back up your files before running this function!

How to group strings according to first half in R?

I have an R code that looks like this:
files <- list.files(get_directory())
files <- files[grepl("*.dat$", files)]
files
where get_directory() is a function I wrote that returns the current directory. So, I'm getting all the files with extension .dat in the directory that I want. But, my files are named as follows:
2^5-3^3-18-simul.dat
2^5-3^3-18-uniform.dat
2^7-3^4-33-simul.dat
2^7-3^4-33-uniform.dat
...
So, now I want to creates groups of 2 according to the first part, so I want 2^5-3^3-18-simul.dat and 2^5-3^3-18-uniform.dat to be one group, the other two files one group etc. While at a later stage, I need to loop through all the groups, and use the two files that are in the same group. Since, the filenames returned are already sorted, I do not think I need some fancy pattern matching here, I just need to get to group the elements of the string vector two by two as mentioned.
We can use sub to create a grouping variable to split the 'files'
split(files, sub("-[a-z].*", "", files))
#$`2^5-3^3-18`
#[1] "2^5-3^3-18-simul.dat" "2^5-3^3-18-uniform.dat"
#$`2^7-3^4-33`
#[1] "2^7-3^4-33-simul.dat" "2^7-3^4-33-uniform.dat"
data
files <- c("2^5-3^3-18-simul.dat", "2^5-3^3-18-uniform.dat",
"2^7-3^4-33-simul.dat", "2^7-3^4-33-uniform.dat")

Adding new column and its value based on file name

I have 10 data files in my current directory such as data-01, data-02, data-03, data-04 till data-10.
Each of these data files has a few hundred rows with 4 fields. I would like to add a new column name "ID" and keep its ID like 01 (for data file data-01) for all the rows in that file.
A base R solution using a loop would go like this:
df<- c()
for (x in list.files(pattern="*.csv")) {
u<-read.table(x)
u$Label = factor(x)
df <- rbind(df, u)
cat(x, "\n ")
}
This depends on your data files having the same number of columns (though you get get around that inside the loop by selecting which columns you need before rbind) and then you can set whichever filetype you are looking at. The cat is useful because you can better trace read problems (because there are always problems). I bet there is a better way to do this with apply as well.

R: Add paste() elements to file

I'm using base::paste in a for loop:
for (k in 1:length(summary$pro))
{
if (k == 1)
mp <- summary$pro[k]
else
mp <- paste(mp, summary$pro[k], sep = ",")
}
mp comes out as one big string, where the elements are separated by commas.
For example mp is "1,2,3,4,5,6"
Then, I want to put mp in a file, where each of its elements is added to a separate column in the same row. My code for this is:
write.table(mp, file = recompdatafile, sep = ",")
However, mp just appears in the CSV as one big string as opposed to being divided up. How can I achieve my desired format?
FYI
I've also tried converting mp to a list, and strsplit()-ing it, neither of which have worked.
Once I've added summary$pro to the file, how can I also add summary$me (which has the same format), in one row with multiple columns?
Thanks,
n.i.
If you want to write something to a file, write.table() isn't the only way. If you want to avoid headers and quotes and such, you can use the more direct cat. For example
cat(summary$pro, sep=",", file="filename.txt")
will write out the vector of values from summary$pro separated by commas more directly. You don't need to build a string first. (And building a string one element at a time as you did above is a bad practice anyway. Most functions in R can operate on an entire vector at a time, including paste).

Resources