I am trying to rename a number of files and folders with a new name.
Example old name: corrected_original_wh_ah108090.pdf
Example new name: corrected_original_gsmp01358_108090.pdf
Example old path: Data/Test2/ARGOS/wh_ah108090/crawl/corrected_original_wh_ah108090.pdf
Example new path:Data/Test2/ARGOS/gsmp01358_108090/crawl/corrected_original_gsmp01358_108090.pdf
Example metadata:
old new
wh_ah108090 gsmp01358_108090
wh_ah108091 gsmp01359_108091
wh_ah108092 gsmp01360_108092
wh_ah108093 gsmp01361_108093
wh_ah108096 gsmp01362_108096
wh_ah108102 gsmp01363_108102
wh_ah108106 gsmp01364_108106
Code:
# Read metadata for ID's #
meta <- read.csv('Metadata.csv')
# list all file paths
pathLs <- list.files('Data/Test2/', recursive = TRUE, full.names = TRUE)
# select only files with old format on the list (for full dataset where some files already have new name)
tbl<- pathLs [!grepl("gsmp", pathLs )]
# select only files with old format on metadata
metadata<- meta[!meta$old =="",]
# function to change old names for new
fileList <- apply(metadata,1,
function(x) {
fnam <- x['old']
fnam <- as.character(unlist(fnam))
newnam <- gsub(fnam, as.character(unlist(x['new'])), tbl[grepl(fnam, tbl)])
return(newnam)})
# Create dataframe with old and new names
to <- as.character(unlist(fileList))
from <- tbl
# Use rename
file.rename(from, to)
For some reason this file rename doesn't work.
Is this because I cannot rename files and directories in a path at the same time?
No loops required.
metadata <- read.table(header=T, stringsAsFactors=F, text="
old new
wh_ah108090 gsmp01358_108090
wh_ah108091 gsmp01359_108091
wh_ah108092 gsmp01360_108092
wh_ah108093 gsmp01361_108093
wh_ah108096 gsmp01362_108096
wh_ah108102 gsmp01363_108102
wh_ah108106 gsmp01364_108106")
metadata$new2 <- sprintf("gsmp%05d_%s",
1357L + seq_len(nrow(metadata)), # 1357 can be anything?
gsub("\\D", "", metadata$old))
metadata
# old new new2
# 1 wh_ah108090 gsmp01358_108090 gsmp01358_108090
# 2 wh_ah108091 gsmp01359_108091 gsmp01359_108091
# 3 wh_ah108092 gsmp01360_108092 gsmp01360_108092
# 4 wh_ah108093 gsmp01361_108093 gsmp01361_108093
# 5 wh_ah108096 gsmp01362_108096 gsmp01362_108096
# 6 wh_ah108102 gsmp01363_108102 gsmp01363_108102
# 7 wh_ah108106 gsmp01364_108106 gsmp01364_108106
file.rename(metadata$old, metadata$new2) # should do it
list.files does not list any directory name, so your code renames only the files, but not the directories. So, theoretically your code should work. Specifically, which part of the code is not working?
Related
Sorry for the generic question. I'm looking for pointers for sorting out a data folder, in which I have numerous .txt files. All of them have different titles, and for the vast majority of them, the files have the same dimension, that is the column numbers are the same. However, the pain is some of the files, despite having the same number of columns, have different column names. That is in those files, some other variables were measured.
I want to weed out these files, and I cannot do by simply comparing column numbers. Is there any method that I can pass a name of the column and check how many files in the directory have that column, so that I can remove them into a different folder?
UPDATE:
I have created a dummy folder to have files to reflect the problem
please see link below to access the files on my google drive. In this folder, I have took 4 files that have the problem columns.
https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing
The problems is the code seem to be able to find files matching the selection criteria, aka the actual name of problem columns, but I cannot extract the real index of such files in the list. Any pointers?
library(data.table)
#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")
#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")
#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)
same.titles <- var.names %in% standar.names
dff.titles <- !var.names %in% standar.names
#confirm the only 3 columns of problem is column 129,130 and 131
mismatched.names <- colnames(df_var[129:131])
#visual check the names of the problematic columns
mismatched.names
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector
to_keep <- which(unlist(column_names)%in% unique_names[1])
#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]
#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )
If you can distinguish the files you'd like to keep from those you'd like to drop depending on the column names, you could use something along these lines:
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = ';',
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])
files_to_keep <- files_in_wd[to_keep]
If you have many files you should probably avoid the loop or just read in the header of the corresponding file.
edit after your comment:
by adding nrows = 2 the code only reads the first 2 rows + the header.
I assume that the first file in the folder has the structure that you'd like to keep, that's why column_names is checked against unique_names[1].
the files_to_keep contains the names of the files you'd like to keep
you could try to run that on a subset of your data and see if it works and worry about efficiency later. A vectorized approach might work better I think.
edit:
This code works with your dummy-data.
library(filesstrings)
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2,
encoding = "UTF-8",
check.names = FALSE
)
}
# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok
# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
'filename' = files_in_wd,
'keep' = NA)
for(i in 2:length(files_in_wd)){
df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}
df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns
# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept
file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")
Due to the large number and size of files it might be worth looking at alternatives to R, e.g. in bash:
for f in ctrl*.txt
do
if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
then echo "$f"
fi
done
This command compares the column names of the 'good file' to the column names of every file and prints out the names of files that do not match.
Before I dive into the question, here is a similar problem asked but there is not yet a solution.
So, I am working in R, and there is a folder in my working directory called columns that contains 198 similar .csv files with the name format of a 6-digit integer (e.g. 100000) that increases inconsistently (since the name of those files are actually names for each variable).
Now, I have would like to full join them, but somehow I have to import all of those files into R and then join them. Naturally, I thought about using a list to contain those files and then use a loop to join them. This is the code I tried to use:
#These are the first 3 columns containing identifiers
matrix_starter <- read_csv("files/matrix_starter.csv")
## import_multiple_csv_files_to_R
# Purpose: Import multiple csv files to the Global Environment in R
# set working directory
setwd("columns")
# list all csv files from the current directory
list.files(pattern=".csv$") # use the pattern argument to define a common pattern for import files with regex. Here: .csv
# create a list from these files
list.filenames <- list.files(pattern=".csv$")
#list.filenames
# create an empty list that will serve as a container to receive the incoming files
list.data <- list()
# create a loop to read in your data
for (i in 1:length(list.filenames))
{
list.data[[i]] <- read.csv(list.filenames[i])
list.data[[i]] <- list.data[[i]] %>%
select(`Occupation.Title`,`X2018.Employment`) %>%
rename(`Occupation title` = `Occupation.Title`) #%>%
#rename(list.filenames[i] = `X2018.Employment`)
}
# add the names of your data to the list
names(list.data) <- list.filenames
# now you can index one of your tables like this
list.data$`113300.csv`
# or this
list.data[1]
# source: https://www.edureka.co/community/1902/how-can-i-import-multiple-csv-files-into-r
The chunk above solve the importing part. Now I have a list of .csv files. Next, I would like to join them:
for (i in 1:length(list.filenames)){
matrix_starter <- matrix_starter %>% full_join(list.data[[i]], by = `Occupation title`)
}
However, this does not work nicely. I end up with somewhere around 47,000 rows, to which I only expect around 1700 rows. Please let me know your opinion.
Reading the files into R as a list and including the file name as a column can be done like this:
files <- list.files(path = path,
full.names = TRUE,
all.files = FALSE)
files <- files[!file.info(files)$isdir]
data <- lapply(files,
function(x) {
data <- read_xls(
x,
sheet = 1
)
data$File_name <- x
data
})
I am assuming now that all your excel files have the same structure: the same columns and column types.
If that is the case you can use dplyr::bind_rows to create one combined data frame.
You could off course loop through the list and left_join the list elements. E.g. by using Reduce and merge.
Update based on mihndang's comment. Is this what you are after when you say: Is there a way to use the file name to name the column and also not include the columns of file names?
library(dplyr)
library(stringr)
path <- "./files"
files <- list.files(path = path,
full.names = TRUE,
all.files = FALSE)
files <- files[!file.info(files)$isdir]
data <- lapply(files,
function(x) {
read.csv(x, stringsAsFactors = FALSE)
})
col1 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Values")
col2 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Character")
df1 <- data[[1]] %>%
rename(!!col1 := Value,
!!col2 := Character)
I created two simple .csv files in ./files: file1.csv and file2.csv. I read them into a list. I extract the first list element (the DF) and work out column names in a variable. I then rename the columns in the DF by passing the two variables to them. The column name includes the file name.
Result:
> View(df1)
> df1
file1: Values file1: Character
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
I guess you are looking for :
result <- Reduce(function(x, y) merge(x, y, by = `Occupation title`, all = TRUE), list.data)
which can be done using purrrs Reduce as well :
result <- purrr::reduce(list.data, dplyr::full_join, by = `Occupation title`)
When you do full join it adds every combination and gives us the tables. if you are looking for unique records then you might want to use left join where keep dataframe/table on left whose all columns you want keep as reference and keep the file you want to join on right.
Hope this helps.
I'm trying to rename the files with .txt extensions in a folder with a corresponding list of names in a column of a table. The table contains two vectors, the first is the heading of the name of the file in the folders, and the second is the actual name that I wish to use retaining the original extension. I can use file rename but how do I replace it with the new name in the corresponding row?
I've tried using a loop with file.rename, except that my code iterates through all the new names in the table with each folder. Not sure if there's an R function that will do this.
library(dplyr)
library(stringr)
headers <- read.csv(file="/mnt/data/Development/Sequences/SampleSheet.csv", skip = 18, header = F, nrows =1, as.is = T)
sample.rows = read.csv(file="/mnt/data/Development/Sequences/SampleSheet.csv", skip = 19, header = F)
colnames(sample.rows) = headers
old.new.names <- select(sample.rows, Sample_Name, Description)
startingDir<-"/mnt/data/Development/Sequences"
tcr.sample <- list.files(path=startingDir, pattern="txt", full.names=TRUE )
new.names <- select(old.new.names, Description)
file.rename(list.files(tcr.sample, pattern = ".txt" replacement=new.names)
Files in the folder have generic names: S01_S1.txt, S02_S2.txt, etc. I also have a file containing a table with 2 columns. The first column identifies each file by the first three characters, such as S05, S06,... S45. The second column has the corresponding new name in for the file in that row, such as RK_ci1151_01, RK_ci1151_02,... RK_ci1151_Baseline. I'm trying to rename the files so that the name is changed to RK_ci1151_01.txt, RK_ci1151_02.. so forth.
I'm also getting a
Error in file.rename(tcr.sample, pattern=".txt", replacement=new.names) : unused arguments (pattern = ".txt, replacement=new.names)
message.
# Script to replace the standard Iseq100 output sample names with name
# in the "Description" column in the SampleSheet.csv file.
library(dplyr)
library(stringr)
# Set the working directory to folder where sample files (.txt) are located.
setwd("/mnt/data/Development/Sequences")
# Extract the headers of the sample table in the file.
headers <- read.csv(file="SampleSheet.csv", skip = 18, header = F, nrows =1, as.is = T)
# Extract the sample rows.
sample.rows = read.csv(file="SampleSheet.csv", skip = 19, header = F)
# Add the headers to the sample rows
colnames(sample.rows) = headers
# Extract the "Descrription" column which contains the actual names of the sample
new.names <- select(sample.rows, Description)
new.names <- paste0(new.names$Description, '.txt')
# Extract target .txt files and rename to Description name.
old.names <- list.files(path=".", pattern=".txt")
file.rename(from=old.names, to=new.names)
I think you can achieve the file renaming with a different approach. If your CSV file lists the unique file names that you want and they are associated with a unique 'grouping' variable (in your case, 'S01' is associated with files RK_ci1151_01, RK_ci1151_02, RK_ci1151_Baseline), then you can use the new names to recreate the old names. In other words, you can substitute the pattern before '_01.txt', '_02.txt', etc. in the new file names with the pattern of the old file names. Then use the columns of the dataframe as the from= and to= arguments in the file.rename call.
### prep toy data
# create df with old and new names
df <- data.frame(old=paste0(rep(letters[1:3],each=3),
'_', rep(c(0:2),3), '.txt'),
new=paste0(rep(c('foo','bar','hello'),each=3),
'_', rep(c(0:2),3), '.txt'),
stringsAsFactors = F)
# write files with old names
for (i in 1:length(df$old)) {
write.table(NULL,df$old[i])
}
list.files(pattern='\\.txt')
[1] "a_0.txt" "a_1.txt" "a_2.txt" "b_0.txt" "b_1.txt" "b_2.txt" "c_0.txt" "c_1.txt" "c_2.txt"
# edit old names to match user code
df$old <- sub('_[0-9]\\.txt','',df$old)
> df
old new
1 a foo_0.txt
2 a foo_1.txt
3 a foo_2.txt
4 b bar_0.txt
5 b bar_1.txt
6 b bar_2.txt
7 c hello_0.txt
8 c hello_1.txt
9 c hello_2.txt
# separate new file names to join with old
df$join <- sub('.*(_[0-9]\\.txt)','\\1',df$new)
df$old1 <- paste0(df$old,df$join)
# rename
file.rename(df$old1, df$new)
list.files(pattern='\\.txt')
[1] "bar_0.txt" "bar_1.txt" "bar_2.txt" "foo_0.txt" "foo_1.txt" "foo_2.txt"
[7] "hello_0.txt" "hello_1.txt" "hello_2.txt"
I have figured out some part of the code, I will describe below, but I find it hard to iterate (loop) the function over a list of files:
library(Hmisc)
filter_173 <- c("kp|917416", "kp|835898", "kp|829747", "kp|767311")
# This is a vector of values that I want to exclude from the files
setwd("full_path_of_directory_with_desired_files")
filepath <- "//full_path_of_directory_with_desired_files"
list.files(filepath)
predict_files <- list.files(filepath, pattern="predict.txt")
# all files that I want to filter have _predict.txt in them
predict_full <- file.path(filepath, predict_files)
# generates full pathnames of all desired files I want to filter
sample_names <- sample_names <- sapply(strsplit(predict_files , "_"), `[`, 1)
Now here is an example of a simple filtering I want to do with one specific example file, this works great. How do I repeat this in a loop on all filenames in predict_full
test_predict <- read.table("a550673-4308980_A05_RepliG_rep2_predict.txt", header = T, sep = "\t")
# this is a file in my current working directory that I set with setwd above
test_predict_filt <- test_predict[test_predict$target_id %nin% filter_173]
write.table(test_predict_filt, file = "test_predict")
Finally how do I place the filtered files in a folder with the same name as original with the suffix filtered?
predict_filt <- file.path(filepath, "filtered")
# Place filtered files in
filtered/ subdirectory
filtPreds <- file.path(predict_filt, paste0(sample_names, "_filt_predict.txt"))
I always get stuck at looping! It is hard to share a 100% reproducible example as everyone's working directory and file paths will be unique though all the code I shared works if you adapt it to an appropriate path name on your machine.
This should work to loop through each of the files and write them out to the new location with the filename specifications you needed. Just be sure to change the directory paths first.
filter_173 <- c("kp|917416", "kp|835898", "kp|829747", "kp|767311") #This is a vector of values that I want to exclude from the files
filepath <- "//full_path_of_directory_with_desired_files"
filteredpath <- "//full_path_of_directory_with_filtered_results/"
# Get vector of predict.txt files
predict_files <- list.files(filepath, pattern="predict.txt")
# Get vector of full paths for predict.txt files
predict_full <- file.path(filepath, predict_files)
# Get vector of sample names
sample_names <- sample_names <- sapply(strsplit(predict_files , "_"), `[`, 1)
# Set for loop to go from 1 to the number of predict.txt files
for(i in 1:length(predict_full))
{
# Load the current file into a dataframe
df.predict <- read.table(predict_full[i], header=T, sep="\t")
# Filter out the unwanted rows
df.predict <- df.predict[!(df.predict$target_id %in% filter_173)]
# Write the filtered dataframe to the new directory
write.table(df.predict, file = file.path(filteredpath, paste(sample_names[i],"_filt_predict.txt",sep = "")))
}
I am working in a folder (directory1) and I need to first modify and then use .csv files present in another folder (directory2).
First I would like to insert values in a column based on the file name; and I would like to do this in a loop for all subjects.
I know how to do it for single files, but not sure how to create the loop.
#Choose directory with .csv files to read
setwd("/Users/R/directory2")
d = read.table("ppt01_EvF.csv", sep=",")
#Change columns names
colnames(d) <- c("Order","Condition","Press","Response","Time","Time2")
#Read file name
filenames <- "ppt01_EvF.csv"
# Remove ".csv"
filenames2 <- sub(".csv", "", filenames)
# Split the string by "_"
filenames_vec <- strsplit(filenames2, split = "_")[[1]]
# Create new column to store the information
d$PPT_N_NUMBER <- filenames_vec[1]
Second, I would like to save all the .csv files as one big file containing all the participants but just one row at the top of the new big file with the columns names.
Last, I would like to save this new big file (.csv) in the folder I am working on (directory1) - so a different directory than the one the single files are stored.
I would appreciate if someone could help me to understand the best way to do this.
It should be something like this:
setwd("/Users/R/directory2")
files <- list.files()
library(data.table)
data_list <- list()
for(i in 1:length(files)){
file_name <- files[i]
d = fread(file_name, sep=",")
#Change columns names
colnames(d) <- c("Order","Condition","Press","Response","Time","Time2")
# Split the string by "_"
filenames_vec <- strsplit(file_name, split = "_")[[1]]
# Create new column to store the information
d$PPT_N_NUMBER <- filenames_vec[1]
data_list[[i]] <- d
}
all_data <- rbindlist(data_list)
fwrite(all_data, '../directory1/all_data.csv')