How to rename part of a file - r

I would to to rename part of a file name, because the structure is hardcoded in getfiles.
I have metabolomics mzML files containing ltQCs, sQCs and samples, but the name of the files have different lenghts (6,6,7).I am trying to run XCMS, but it only picks up ltQCs and sQCs, because the structure is hardcoded to 6. How do I change the structure of the filename see example below:
2020-02-02_B1W1_RP_NEG_P7_A20_001.mzML (structure of 7)
to
2020-02-02_B1W1_RP_NEG_P7A20_001.mzML (structure of 6)
I have higlighted the part that I would like to change. If this is impossible, maybe renaming the ltQCs and sQCs may be easier by adding a letter or number, so I get a structure of 7 and then change the structure in getfiles to 7.
Hope somebody can help, thank you:)
Best

You can change the file names with a regular expression using gsub which removes the penultimate underline
my_regex <- "(_)([[:alnum:]]{3}_[[:alnum:]]{3}[.]mzML)"
my_filename <- "2020-02-02_B1W1_RP_NEG_P7_A20_001.mzML"
gsub(my_regex, "\\2", my_filename)
#> [1] "2020-02-02_B1W1_RP_NEG_P7A20_001.mzML"
So you could do something like
rename_mzMLs <- function(directory)
{
filenames <- list.files(directory, pattern = ".mzML")
my_regex <- "(_)([[:alnum:]]{3}_[[:alnum:]]{3}[.]mzML)"
new_filenames <- gsub(my_regex, "\\2", filenames)
file.rename(filenames, new_filenames)
}
And run it by doing
rename_mzMLs("C:/path/to/mzML/files/")
Obviously, I can't test this since I don't have any mzML files, so ensure you back up your files before running this function!

Related

Renaming files from RStudio with A DF as Reference

I'm trying to rename files in a WD folder from RStudio. The files are named with IDs and I want to replace the IDs with names. I have a reference file which is a dataframe(urban_o) with supplierID, companyname, and vendornumber. I tried this for loop but it doesn't seem to work. Error - the condition has length > 1 and only the first element will be used. Any ideas where I'm getting it wrong?
original_names <- list.files()
urba_o <- import("C:\Users\MaangiJ\Downloads\urba_o.xlsx")
# for loop
for (x in original_names){
if(x == urba_o$supplierid[]){
file.rename(x,urba_o$CompanyName[])
}
}
file.rename is vectorized, so no for loop is needed. Something like this should work:
## figure out which files are here and need renaming
rows_to_rename = urba_o$supplierid %in% original_names
## rename them
with(urba_o[rows_to_rename, ], file.rename(supplierid, CompanyName))
If you did want a for loop, this would work (though it will be less efficient, as well as longer to write):
for (i in 1:nrow(urba_o)) {
if(urba_o$supplierid[i] %in% original_names) {
file.rename(urba_o$supplierid[i], urba_o$CompanyName[i])
}
}
Do note that you'll need to follow the file name rules for your operating system. For example, on Windows file names can't have the following reserved characters: <>/\*'":?|

File renaming (string substitution) without a clear pattern using R

Currently, I am working with a long list of files.
They have a name pattern of SB_xxx_(parts). (different extensions), where xxx refers to an item code.
SB_19842.png
SB_19842_head.png
SB_19842_hand.png
SB_19842_head.pdf
...
It is found that many of these codes have incorrect entries.
I got two columns in hand: One is for old codes and one is new codes (let's say A & B). I hope to change all those old codes in the file names to the new code.
old new
12154 24124
92482 02425
.....
My first thought is to use file.rename()
However, it is a one-to-one changing approach. I cannot do this because every item has a different number of parts and different file extensions.
Is there any recursive method that can simply change all incorrect file names with strings in A and replace them with strings in B? Anyone get an idea, please?
A loop solution with purrr::map2 at the end:
library(purrr)
#create files to rename
file.create("SB_19842.png")
file.create("SB_19842_head.png")
file.create("SB_19842_hand.png")
file.create("SB_19842_head.pdf")
file.create("SB_12154.png")
file.create("SB_12154_head.png")
file.create("SB_12154_hand.png")
file.create("SB_12154_head.pdf")
# a dataframe with old a nd new patterns
file_names <- data.frame(
old = c("19842", "12154"),
new = c("new1", "new2")
)
# old filenames from the directory, specify path if needed
file_names_SB <- list.files(pattern = "SB_")
# function to substitute one type of code with another
sub_one_code <- function(old_code, new_code, file_names_original){
gsub(paste0("SB_", old_code), paste0("SB_", new_code), file_names_original)
}
# loop to substitute all codes
new_file_names <- file_names_SB
for (row in 1:nrow(file_names)){
new_file_names <- sub_one_code(file_names[row, "old"], file_names[row, "new"], new_file_names)
}
# rename all the files
map2(file_names_SB,
new_file_names,
file.rename)
#thelatemail provided a link with more elegant solutions for generating new file names.

R: locating files that their names contain a specific string from a directory and match to my list of wanted files

It's me the newbie again with another messy file and folder situation(thanks to us biologiests): I got this directory containing a huge amount of .txt files (~900,000+), all the files have been previously handed with inconsistent naming format :(
For example, messy files in directory look like these:
ctrl_S978765_uns_dummy_00_none.txt
ctrl_S978765_3S_Cookie_00_none.txt
S59607_3S_goody_3M_V10.txt
ctrlnuc30-100_S3245678_DMSO_00_none.txt
ctrlRAP_S0846567_3S_Dex_none.txt
S6498432_2S_Fulra_30mM_V100.txt
.....
As you see the naming has no reliable consistency. What's important for me is the ID code embedded in them, such as S978765. Now I have got a list (100 ID codes) of these ID codes that I want.
The CSV file containing the list as below, mind you the list does have repetitive ID codes in the row due to different CLnumber value in the second columns:
ID code CLnumber
S978765 1
S978765 2
S306223 1
S897458 1
S514486 2
....
So I want to achieve below task: find all the messy named files using the code IDs by matching to my list. And copy them into a new directory.
I have thought of use list.files() to get all the .txt files and their names, then I got stuck at the next step at matching the code ID names, I know how to do it with one string, say "S978765", but if I do it one by one, this is almost just like manual digging the folder.
How could I feed the ID code names in column1 as a list and compare/match them with the messy file title names in the directory and then copy them into a new folder?
Many thanks,
ML
This works:
library(stringr)
# get this via list.files in your actual code
files <- c("ctrl_S978765_uns_dummy_00_none.txt",
"ctrl_S978765_3S_Cookie_00_none.txt",
"S59607_3S_goody_3M_V10.txt",
"ctrlnuc30-100_S3245678_DMSO_00_none.txt",
"ctrlRAP_S0846567_3S_Dex_none.txt",
"S6498432_2S_Fulra_30mM_V100.txt")
ids <- data.frame(`ID Code` = c("S978765", "S978765", "S306223", "S897458", "S514486"),
CLnumber = c(1, 2, 1, 1, 2),
stringsAsFactors = FALSE)
str_subset(files, paste(ids$ID.Code, collapse = "|"))
#> [1] "ctrl_S978765_uns_dummy_00_none.txt" "ctrl_S978765_3S_Cookie_00_none.txt"
str_subset takes a character vector and returns elements matching some pattern. In this case, the pattern is "S978765|S978765|S306223|S897458|S514486" (created by using paste), which is a regular expression that matches any of the ID codes separated by |. So we take files and keep only the elements that have a match in ID Code.
There are many other ways to do this, which may or may not be more clear. For example, you could pass ids$ID.Code directly to str_subset instead of constructing a regular expression via paste, but that would throw a warning about object lengths every time, which could get confusing (or cause problems if you get used to ignoring it and then ignore it in a different context where it matters). Another method would be to use purrr and keep, but while that might be a little bit more clear to write, it would be a lot more inefficient since it would mean making multiple passes over the files vector -- not relevant in this context, but possibly very relevant if you suddenly need to do this for hundreds of thousands of files and IDs.
You could use regex to extract the ID codes from the file name.
Here, I have used the pattern "S" followed by 5 or more numbers. Once we extract the ID_codes, we can compare them with the ones which we have in csv.
Assuming the csv is called df and the column name is ID_Codes we can use %in% to filter them.
We can then use file.copy to move files from one folder to another folder.
all_files <- list.files(path = '/Path/To/Folder', full.names = TRUE)
selected_files <- all_files[sub('.*(S\\d{5,}).*', '\\1', basename(all_files))
%in% unique(df$ID_Codes)]
file.copy(selected_files, 'new_path/for/files')

R - how to find exact pattern when listing files

I have a number of files from which I would like to find only the ones that match an exact pattern.
When I run:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
files <- list.files(idir, pattern=mods[1])
I got the results:
> files
[1] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc"
[2] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc"
[4] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc"
[5] "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[6] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
which is wrong, because I just wanted the last three names (which match the EXACT pattern I wish).
Even if I use regex to create the pattern, I will get a empty vector as result:
files <- list.files(idir, pattern=paste("^",m[1],"$", sep=''), full.names=T)
> files
character(0)
What am I missing here?
Thanks!
Your solution works, the first three files also have the pattern GISS-E2-H.
To get only the last three, you can do as suggested by #G.Grothendieck and add the _ to mods:
mods=c('GISS-E2-H_','GISS-E2-R','GISS-E2-R-CC')
Now to test your solution I'll create the files:
allfiles <- c("clt_Amon_GISS-E2-H-CC_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H-CC_historical_r1i1p1_195101-201012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc",
"clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc")
for (file in allfiles) {
write("empty file", file)
}
Now it works:
> list.files(getwd(), pattern=mods[1])
[1] "clt_Amon_GISS-E2-H_historical_r1i1p1_185001-190012.nc" "clt_Amon_GISS-E2-H_historical_r1i1p1_190101-195012.nc"
[3] "clt_Amon_GISS-E2-H_historical_r1i1p1_195101-200512.nc"
Edit:
An alternative is as originally proposed, and instead of replacing mods you can append the _ inside list.files:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC') #Original
list.files(getwd(), pattern=paste0(mods[1], "_"))
I would use this with caution, though. If you turn this into some kind of loop to also read the other file patterns in mods, the _ will be appended to all patterns, making them possibly incorrect.
Try this:
files <- list.files(idir, pattern = ".*GISS-E2-Hd.*")
Your original vector of patterns was this:
mods=c('GISS-E2-H','GISS-E2-R','GISS-E2-R-CC')
which was trying to match exactly files called GISS-E2-H etc. Since those files do not exits in your idir you were getting back character(0).

R read files with for loop

I just want to use use 10 files in R. For each I want to calculate something.
Exp. file:
stat1_pwg1.out
stat23_pwg2.out
..
stat45_pwg10.out
I try this:
for (i in 1:10){
Data=paste("../XYZ/*_pwg",i,".out",sep="")
line=read.table(Data,head=T)
}
But it does not work? Any hinds?
I suspect your problem comes from the wildcard *. A better way to do this might be to first store the file names using dir, then find the ones you want.
files <- dir("../XYZ",pattern="stat[0-9]+_pwg[0-9]+\.out")
for(f in files) {
line=read.table(Data,head=T)
}
You could also use one of the apply family of functions to eliminate the for loop entirely.
A few things about your code.
paste is vectorised, so you can take it out of the loop.
paste("../XYZ/*_pwg", 1:10, ".out", sep = "")
(Though as you'll see in a moment, you don't actually need to use paste at all.)
read.table won't accept wildcards; it needs an exact match on the file name.
Rather than trying to construct a vector of the filenames, you might be better using dir to find the files that exist in your directory, filtered by a suitable naming scheme.
To filter the files, you use a regular expression in the pattern argument. You can convert from wildcards to regular expression using glob2rx.
file_names <- dir("../XYZ", pattern = glob2rx("stat*_pwg*.out"))
data_list <- lapply(filenames, read.table, header = TRUE)
For a slightly more specific fit, where the wildcard only matches numbers than anything, you need to use regular expressions directly.
file_names <- dir("../XYZ", pattern = "^stat[[:digit:]]+_pwg[[:digit:]]+\\.out$")
files <- dir(pattern="*Rip1_*")
files
for (F in files){ assign(F , Readfunc(F))}

Resources