R -find and replace within a script, iteratively [duplicate] - r

This question already has an answer here:
R: list files based on pattern
(1 answer)
Closed 1 year ago.
I have a somewhat complex script that is working well. It imports multiple .csvs, combines them, adjusts them, re-sorts them and writes them out as multiple new .csvs. All good.
The problem is that I need to run this script on each of 2100 files. Each .csv file has a name incorporating a seven or eight digit non-numeric string which also has other specific identifiers. There are numerous files with the same string suffix and the script works on all of them at once. An example of the naming system:
gfdlesm2g_45Fall_17100202.csv
ccsm4_45Fall_10270102.csv
bnuesm_45Fall_5130205.csv
mirocesmchem_45Fall_5010007.csv
The script begins with fnames <- dir("~/Desktop/modified_files/", pattern = "*_45Fall_1030001.csv")
And I need to replace the "1030001", in this case, with the next number. Right now I am using Find and Replace in RStudio to replace the seven (or eight) digit number each time the script has completed. I know there has to be a better way than to do this all manually for 2100 files.
All the research I've found is for iterating within a dataframe or whatever, in the columns or rows, and I can't process how to make this work for my needs.
I am thinking that if I made a vector of all the numbers (really they're names), like "01080204", "01090003", "01100001", "18020116", "18020125", "15080303", "16020301", "03170006", "04010101", "04010201", etc
There must be a way to say, in code, "now pick the next name, and run the script". I looked at the lapply, mapply, sapply family and couldn't seem to figure it out.

If you are looking for pattern in files _45Fall_ you can use list.files.
fnames <- list.files("~/Desktop/modified_files/", pattern = "*_45Fall_\\d+\\.csv$")

Related

R: locating files that their names contain a specific string from a directory and match to my list of wanted files

It's me the newbie again with another messy file and folder situation(thanks to us biologiests): I got this directory containing a huge amount of .txt files (~900,000+), all the files have been previously handed with inconsistent naming format :(
For example, messy files in directory look like these:
ctrl_S978765_uns_dummy_00_none.txt
ctrl_S978765_3S_Cookie_00_none.txt
S59607_3S_goody_3M_V10.txt
ctrlnuc30-100_S3245678_DMSO_00_none.txt
ctrlRAP_S0846567_3S_Dex_none.txt
S6498432_2S_Fulra_30mM_V100.txt
.....
As you see the naming has no reliable consistency. What's important for me is the ID code embedded in them, such as S978765. Now I have got a list (100 ID codes) of these ID codes that I want.
The CSV file containing the list as below, mind you the list does have repetitive ID codes in the row due to different CLnumber value in the second columns:
ID code CLnumber
S978765 1
S978765 2
S306223 1
S897458 1
S514486 2
....
So I want to achieve below task: find all the messy named files using the code IDs by matching to my list. And copy them into a new directory.
I have thought of use list.files() to get all the .txt files and their names, then I got stuck at the next step at matching the code ID names, I know how to do it with one string, say "S978765", but if I do it one by one, this is almost just like manual digging the folder.
How could I feed the ID code names in column1 as a list and compare/match them with the messy file title names in the directory and then copy them into a new folder?
Many thanks,
ML
This works:
library(stringr)
# get this via list.files in your actual code
files <- c("ctrl_S978765_uns_dummy_00_none.txt",
"ctrl_S978765_3S_Cookie_00_none.txt",
"S59607_3S_goody_3M_V10.txt",
"ctrlnuc30-100_S3245678_DMSO_00_none.txt",
"ctrlRAP_S0846567_3S_Dex_none.txt",
"S6498432_2S_Fulra_30mM_V100.txt")
ids <- data.frame(`ID Code` = c("S978765", "S978765", "S306223", "S897458", "S514486"),
CLnumber = c(1, 2, 1, 1, 2),
stringsAsFactors = FALSE)
str_subset(files, paste(ids$ID.Code, collapse = "|"))
#> [1] "ctrl_S978765_uns_dummy_00_none.txt" "ctrl_S978765_3S_Cookie_00_none.txt"
str_subset takes a character vector and returns elements matching some pattern. In this case, the pattern is "S978765|S978765|S306223|S897458|S514486" (created by using paste), which is a regular expression that matches any of the ID codes separated by |. So we take files and keep only the elements that have a match in ID Code.
There are many other ways to do this, which may or may not be more clear. For example, you could pass ids$ID.Code directly to str_subset instead of constructing a regular expression via paste, but that would throw a warning about object lengths every time, which could get confusing (or cause problems if you get used to ignoring it and then ignore it in a different context where it matters). Another method would be to use purrr and keep, but while that might be a little bit more clear to write, it would be a lot more inefficient since it would mean making multiple passes over the files vector -- not relevant in this context, but possibly very relevant if you suddenly need to do this for hundreds of thousands of files and IDs.
You could use regex to extract the ID codes from the file name.
Here, I have used the pattern "S" followed by 5 or more numbers. Once we extract the ID_codes, we can compare them with the ones which we have in csv.
Assuming the csv is called df and the column name is ID_Codes we can use %in% to filter them.
We can then use file.copy to move files from one folder to another folder.
all_files <- list.files(path = '/Path/To/Folder', full.names = TRUE)
selected_files <- all_files[sub('.*(S\\d{5,}).*', '\\1', basename(all_files))
%in% unique(df$ID_Codes)]
file.copy(selected_files, 'new_path/for/files')

Using R to read all files in a specific format and with specific extension

I want to read all the files in the xlsx format starting with a string named "csmom". I have used list.files function. But I do not know how to set double pattern. Please see the code. I want to read all the files starting csmom string and they all should be in .xlsx format.
master1<-list.files(path = "C:/Users/Admin/Documents/csmomentum funal",pattern="^csmom")
master2<-list.files(path = "C:/Users/Admin/Documents/csmomentum funal",pattern="^\\.xlsx$")
#jay.sf solution works for creating a regular expression to pull out the condition that you want.
However, generally speaking if you want to cross two lists to find the subset of elements that are contained in both (in your case the files that satisfy both conditions), you can use intersect().
intersect(master1, master2)
Will show you all the files that satisfy pattern 1 and pattern 2.

How to remove characters from filenames in R?

Due to an unrelated software error I have many files with double filenames e.g.
c("info10_file1.info10_file1.xy", "info11_file1.info11_file1.xy")
I want to remove this repetition, The files should be:
c("info10_file1.xy", "info11_file1.xy")
I have tried using sapply for a file rename function, but that requires a pattern which means only patterns with info10 will be changed.
So running this code:
sapply(files_list,FUN=function(eachPath){
....file.rename(from=eachPath,to=sub(pattern='info10_file1.',replacement='',eachPath))
})
Will result in:
"info10_file1.xy", "info11_file1.info11_file1.xy"
An improvement can be done by setting pattern='file1.info' which means all files will be processed, but the number 10 or 11 in info10 or info11 will still be repeated producing this:
"info10_10_file1.xy", "info11_11_file1.xy"
Is there a way to simply delete an arbitrary amount of characters? In this example it would be 13.

How to group strings according to first half in R?

I have an R code that looks like this:
files <- list.files(get_directory())
files <- files[grepl("*.dat$", files)]
files
where get_directory() is a function I wrote that returns the current directory. So, I'm getting all the files with extension .dat in the directory that I want. But, my files are named as follows:
2^5-3^3-18-simul.dat
2^5-3^3-18-uniform.dat
2^7-3^4-33-simul.dat
2^7-3^4-33-uniform.dat
...
So, now I want to creates groups of 2 according to the first part, so I want 2^5-3^3-18-simul.dat and 2^5-3^3-18-uniform.dat to be one group, the other two files one group etc. While at a later stage, I need to loop through all the groups, and use the two files that are in the same group. Since, the filenames returned are already sorted, I do not think I need some fancy pattern matching here, I just need to get to group the elements of the string vector two by two as mentioned.
We can use sub to create a grouping variable to split the 'files'
split(files, sub("-[a-z].*", "", files))
#$`2^5-3^3-18`
#[1] "2^5-3^3-18-simul.dat" "2^5-3^3-18-uniform.dat"
#$`2^7-3^4-33`
#[1] "2^7-3^4-33-simul.dat" "2^7-3^4-33-uniform.dat"
data
files <- c("2^5-3^3-18-simul.dat", "2^5-3^3-18-uniform.dat",
"2^7-3^4-33-simul.dat", "2^7-3^4-33-uniform.dat")

R: Extract value and lines after key word (text file mining)

Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.
ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.

Resources