Save multiple files with multiple names with R - r

I have three files (T1.dnd, T2.dnd, T3.dnd) in which in all of them i have to substitude a specific row with another one with mgsub function. After doing that i have to save these files separately in a new folder (Output) with their respective names (T1.dnd, T2.dnd, T3.dnd) with a for loop. However the loop save only the last file (T3.dnd) and not all three (so T1.dnd and T2.dnd are missing).
How can i save all the three files with their names?
Any help will be greatly appreciated.
library(mgsub)
setwd("C:/Users/feder/Project/test")
treat <- as.list(c("C:/Users/feder/Project/test/T1.dnd",
"C:/Users/feder/Project/test/T2.dnd",
"C:/Users/feder/Project/test/T3.dnd"))
step <- list()
step2 <- list()
step <- lapply(treat, readLines)
step2 <- lapply(step, mgsub, c("______Leaf_fraction__________0.3100"),
c("______Leaf_fraction__________0.1500"))
files <- list.files("C:/Users/feder/Project/test")
names_files <- as.list(files)
savewd <- c("C:/Users/feder/Project/Output")
for (i in length(step2)){
writeLines(step2[[i]], paste(savewd, names_files[[i]], sep = "/"))
}

Related

Loading files from multiple folders in r and processing

I have multiple csv files in different folders i.e. Main folder contains week 1 and week 2 folder. Week1 in turns contains file1.csv and week2 contains file2.csv. All files have same column name. There are 100's of such files in different directories
file1 <- data.frame(name = c("Bill","Tom"),Age = c(23,45))
file2 <- data.frame(name = c("Harry","John"),Age = c(34,56))
How can I load them and do a rbind in r and get them in a final data frame
I got some clue here: How can I read multiple files from multiple directories into R for processing?
what I did is slight modification to the function to do row bind as follows but nowhere near to what I want
# Creating a function to process each file
empty_df <- data.frame()
processFile <- function(f) {
df <- read.csv(f)
rbind(empty_df,df)
}
# Find all .csv files
files <- dir("/foo/bar/", recursive=TRUE, full.names=TRUE, pattern="\\.csv$")
# Apply the function to all files.
result <- sapply(files, processFile)
Any help is greatly appreciated!
I'd have tried to do something with a for loop on my side such as
temp = read.csv('week1/file1.csv')
for(i in 2:n){ #n being the number of weeks you have
temp= rbind(temp, read.csv(paste('week',i,'/file',i,'.csv', sep='')))
}
I hope it helped

R: how to find select files in a folder based on matching specific column title

Sorry for the generic question. I'm looking for pointers for sorting out a data folder, in which I have numerous .txt files. All of them have different titles, and for the vast majority of them, the files have the same dimension, that is the column numbers are the same. However, the pain is some of the files, despite having the same number of columns, have different column names. That is in those files, some other variables were measured.
I want to weed out these files, and I cannot do by simply comparing column numbers. Is there any method that I can pass a name of the column and check how many files in the directory have that column, so that I can remove them into a different folder?
UPDATE:
I have created a dummy folder to have files to reflect the problem
please see link below to access the files on my google drive. In this folder, I have took 4 files that have the problem columns.
https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing
The problems is the code seem to be able to find files matching the selection criteria, aka the actual name of problem columns, but I cannot extract the real index of such files in the list. Any pointers?
library(data.table)
#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")
#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")
#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)
same.titles <- var.names %in% standar.names
dff.titles <- !var.names %in% standar.names
#confirm the only 3 columns of problem is column 129,130 and 131
mismatched.names <- colnames(df_var[129:131])
#visual check the names of the problematic columns
mismatched.names
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector
to_keep <- which(unlist(column_names)%in% unique_names[1])
#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]
#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )
If you can distinguish the files you'd like to keep from those you'd like to drop depending on the column names, you could use something along these lines:
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = ';',
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])
files_to_keep <- files_in_wd[to_keep]
If you have many files you should probably avoid the loop or just read in the header of the corresponding file.
edit after your comment:
by adding nrows = 2 the code only reads the first 2 rows + the header.
I assume that the first file in the folder has the structure that you'd like to keep, that's why column_names is checked against unique_names[1].
the files_to_keep contains the names of the files you'd like to keep
you could try to run that on a subset of your data and see if it works and worry about efficiency later. A vectorized approach might work better I think.
edit:
This code works with your dummy-data.
library(filesstrings)
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2,
encoding = "UTF-8",
check.names = FALSE
)
}
# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok
# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
'filename' = files_in_wd,
'keep' = NA)
for(i in 2:length(files_in_wd)){
df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}
df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns
# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept
file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")
Due to the large number and size of files it might be worth looking at alternatives to R, e.g. in bash:
for f in ctrl*.txt
do
if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
then echo "$f"
fi
done
This command compares the column names of the 'good file' to the column names of every file and prints out the names of files that do not match.

Loop subset over several files in a directory and output files into a new directory with a suffix

I have figured out some part of the code, I will describe below, but I find it hard to iterate (loop) the function over a list of files:
library(Hmisc)
filter_173 <- c("kp|917416", "kp|835898", "kp|829747", "kp|767311")
# This is a vector of values that I want to exclude from the files
setwd("full_path_of_directory_with_desired_files")
filepath <- "//full_path_of_directory_with_desired_files"
list.files(filepath)
predict_files <- list.files(filepath, pattern="predict.txt")
# all files that I want to filter have _predict.txt in them
predict_full <- file.path(filepath, predict_files)
# generates full pathnames of all desired files I want to filter
sample_names <- sample_names <- sapply(strsplit(predict_files , "_"), `[`, 1)
Now here is an example of a simple filtering I want to do with one specific example file, this works great. How do I repeat this in a loop on all filenames in predict_full
test_predict <- read.table("a550673-4308980_A05_RepliG_rep2_predict.txt", header = T, sep = "\t")
# this is a file in my current working directory that I set with setwd above
test_predict_filt <- test_predict[test_predict$target_id %nin% filter_173]
write.table(test_predict_filt, file = "test_predict")
Finally how do I place the filtered files in a folder with the same name as original with the suffix filtered?
predict_filt <- file.path(filepath, "filtered")
# Place filtered files in
filtered/ subdirectory
filtPreds <- file.path(predict_filt, paste0(sample_names, "_filt_predict.txt"))
I always get stuck at looping! It is hard to share a 100% reproducible example as everyone's working directory and file paths will be unique though all the code I shared works if you adapt it to an appropriate path name on your machine.
This should work to loop through each of the files and write them out to the new location with the filename specifications you needed. Just be sure to change the directory paths first.
filter_173 <- c("kp|917416", "kp|835898", "kp|829747", "kp|767311") #This is a vector of values that I want to exclude from the files
filepath <- "//full_path_of_directory_with_desired_files"
filteredpath <- "//full_path_of_directory_with_filtered_results/"
# Get vector of predict.txt files
predict_files <- list.files(filepath, pattern="predict.txt")
# Get vector of full paths for predict.txt files
predict_full <- file.path(filepath, predict_files)
# Get vector of sample names
sample_names <- sample_names <- sapply(strsplit(predict_files , "_"), `[`, 1)
# Set for loop to go from 1 to the number of predict.txt files
for(i in 1:length(predict_full))
{
# Load the current file into a dataframe
df.predict <- read.table(predict_full[i], header=T, sep="\t")
# Filter out the unwanted rows
df.predict <- df.predict[!(df.predict$target_id %in% filter_173)]
# Write the filtered dataframe to the new directory
write.table(df.predict, file = file.path(filteredpath, paste(sample_names[i],"_filt_predict.txt",sep = "")))
}

Loop to insert values in a column based on file name and save in different directory

I am working in a folder (directory1) and I need to first modify and then use .csv files present in another folder (directory2).
First I would like to insert values in a column based on the file name; and I would like to do this in a loop for all subjects.
I know how to do it for single files, but not sure how to create the loop.
#Choose directory with .csv files to read
setwd("/Users/R/directory2")
d = read.table("ppt01_EvF.csv", sep=",")
#Change columns names
colnames(d) <- c("Order","Condition","Press","Response","Time","Time2")
#Read file name
filenames <- "ppt01_EvF.csv"
# Remove ".csv"
filenames2 <- sub(".csv", "", filenames)
# Split the string by "_"
filenames_vec <- strsplit(filenames2, split = "_")[[1]]
# Create new column to store the information
d$PPT_N_NUMBER <- filenames_vec[1]
Second, I would like to save all the .csv files as one big file containing all the participants but just one row at the top of the new big file with the columns names.
Last, I would like to save this new big file (.csv) in the folder I am working on (directory1) - so a different directory than the one the single files are stored.
I would appreciate if someone could help me to understand the best way to do this.
It should be something like this:
setwd("/Users/R/directory2")
files <- list.files()
library(data.table)
data_list <- list()
for(i in 1:length(files)){
file_name <- files[i]
d = fread(file_name, sep=",")
#Change columns names
colnames(d) <- c("Order","Condition","Press","Response","Time","Time2")
# Split the string by "_"
filenames_vec <- strsplit(file_name, split = "_")[[1]]
# Create new column to store the information
d$PPT_N_NUMBER <- filenames_vec[1]
data_list[[i]] <- d
}
all_data <- rbindlist(data_list)
fwrite(all_data, '../directory1/all_data.csv')

How to apply a function to every possible pairwise combination of files stored in a common directory

I have a directory containing a large number of csv files. I would like to load the data into R and apply a function to every possible pair combination of csv files in the directory, then write the output to file.
The function that I would like to apply is matchpt() from the biobase library which compares locations between two data frames.
Here is an example of what I would like to do (although I have many more files than this):
Three files in directory: A, B and C
Perform matchpt on each pairwise combination:
nn1 = matchpt(A,B)
nn2 = matchpt(A,C)
nn3 = matchpt(B,C)
Write nn1, nn2 and nn3 to csv file.
I have not been able to find any solutions for this yet and would appreciate any suggestions. I am really not sure where to go from here but I am assuming that some sort of nested for loop is required to somehow cycle sequentially through all pairwise combinations of files. Below is a beginning at something but this only compares the first file with all the others in the directory so does not work!
library("Biobase")
# create two lists of identical filenames stored in the directory:
filenames1 = list.files(path=dir, pattern="csv$", full.names=FALSE, recursive=FALSE)
filenames2 = list.files(path=dir, pattern="csv$", full.names=FALSE, recursive=FALSE)
for(i in 1:length(filenames2)){
# load the first data frame in list 1
df1 <- lapply(filenames1[1], read.csv, header=TRUE, stringsAsFactors=FALSE)
df1 <- data.frame(df1)
# load a second data frame from list 2
df2 <- lapply(filenames2[i], read.csv, header=TRUE, stringsAsFactors=FALSE)
df2 <- data.frame(df2)
# isolate the relevant columns from within the two data frames
dat1 <- as.matrix(df1[, c("lat", "long")])
dat2 <- as.matrix(df2[, c("lat", "long")])
# run the matchpt function on the two data frames
nn <- matchpt(dat1, dat2)
#Extract the unique id code in the two filenames (for naming the output file)
file1 = filenames1[1]
code1 = strsplit(file1,"_")[[1]][1]
file2 = filenames2[i]
code2 = strsplit(file2,"_")[[1]][1]
outname = paste(code1, code2, sep=”_”)
outfile = paste(code, "_nn.csv", sep="")
write.csv(nn, file=outname, row.names=FALSE)
}
Any suggestions on how to solve this problem would be greatly appreciated. Many thanks!
You could do something like:
out <- combn( list.files(), 2, FUN=matchpt )
write.table( do.call( rbind, out ), file='output.csv', sep=',' )
This assumes that matchpt is expecting 2 strings with the names of the files and that the result is the same structure each time so that the rbinding makes sense.
You could also write your own function to pass to combn that takes the 2 file names, runs matchpt and then appends the results to the csv file. Remember that if you pass an open filehandle to write.table then it will append to the file instead of overwriting what is there.
Try this example:
#dummy filenames
filenames <- paste0("file_",1:5,".txt")
#loop through unique combination
for(i in 1:(length(filenames)-1))
for(j in (i+1):length(filenames))
{
flush.console()
print(paste("i=",i,"j=",j,"|","file1=",filenames[i],"file2=",filenames[j]))
}
In response to my question I seem to have found a solution. The below uses a for loop to perform every pairwise combination of files in a common directory (this seems to work and gives EVERY combination of files i.e. A & B and B & A):
# create a list of filenames
filenames = list.files(path=dir, pattern="csv$", full.names=FALSE, recursive=FALSE)
# For loop to compare the files
for(i in 1:length(filenames)){
# load the first data frame in the list
df1 = lapply(filenames[i], read.csv, header=TRUE, stringsAsFactors=FALSE)
df1 = data.frame(df1)
file1 = filenames[i]
code1 = strsplit(file1,"_")[[1]][1] # extract unique id code of file (in case where the id comes before an underscore)
# isolate the columns of interest within the first data frame
d1 <- as.matrix(df1[, c("lat_UTM", "long_UTM")])
# load the comparison file
for (j in 1:length(filenames)){
# load the second data frame in the list
df2 = lapply(filenames[j], read.csv, header=TRUE, stringsAsFactors=FALSE)
df2 = data.frame(df2)
file2 = filenames[j]
code2 = strsplit(file2,"_")[[1]][1] # extract uniqe id code of file 2
# isolate the columns of interest within the second data frame
d2 <- as.matrix(df2[, c("lat_UTM", "long_UTM")])
# run the comparison function on the two data frames (in this case matchpt)
out <- matchpt(d1, d2)
# Merge the unique id code in the two filenames (for naming the output file)
outname = paste(code1, code2, sep="_")
outfile = paste(outname, "_out.csv", sep="")
# write the result to file
write.csv(out, file=outfile, row.names=FALSE)
}
}

Resources