Reading and naming multiple .txt files in R - r

I want to read and name multiple .txt files in R. To be more clear (sample): I have 2 subfolders, each one with three .txt files (they have the same name). Subfolder 'test' has 3 .txt files with names 'alpha.txt','bita.txt','gamma.txt' and subfolder 'train' has 3 .txt files with names 'alpha.txt','bita.txt','gamma.txt'. I am using the following code:
files <- dir(recursive=TRUE,pattern ='\\.txt$')
List <- lapply(files,read.table,fill=TRUE)
which gives a List with 6 elements, each one a data frane. I know that the first element is the 'alpha' from test folder, the second element the 'bita' from the test folder and so on. But as the files are more I would like to read the data in order to have in the environment variables: 'test_alpha','test_bita','test_gamma','train_alpha','train_bita','train_gamma'. Is there a way to do it?

I created two folders in my working directory /train and /test. We create two arrays and write them one to each folder.
df1 <- data.frame(matrix(rnorm(9), 3, 3))
df2 <- data.frame(matrix(runif(12), 4,3))
write(df1, './test/alpha.txt')
write(df2, './train/alpha.txt')
We run your code:
files <- dir(recursive=TRUE,pattern ='\\.txt$')
List <- lapply(files,read.table,fill=TRUE)
files
[1] "test/alpha.txt" "train/alpha.txt"
It works to isolate the files we need. Next we take out the forward slash and file extension.
newnames <- gsub('/', '_', files)
newnames1 <- gsub('\\.txt', '', newnames)
newnames1
[1] "test_alpha" "train_alpha"
This vector can now be assigned to List to name each array.
names(List) <- newnames1
List
$test_alpha
V1 V2 V3 V4 V5
1 -0.6594299 -0.01881557 0.7076588 -0.7096888 0.3629274
2 -1.4401000 1.59659000 -1.9041430 2.3079960 NA
$train_alpha
V1 V2 V3 V4 V5
1 0.9307107 0.6257928 0.6903179 0.5143920 0.6798936
2 0.3652738 0.9297527 0.1902556 0.7243708 0.4541548
3 0.5565041 0.5276907 NA NA NA

Related

R: how to find select files in a folder based on matching specific column title

Sorry for the generic question. I'm looking for pointers for sorting out a data folder, in which I have numerous .txt files. All of them have different titles, and for the vast majority of them, the files have the same dimension, that is the column numbers are the same. However, the pain is some of the files, despite having the same number of columns, have different column names. That is in those files, some other variables were measured.
I want to weed out these files, and I cannot do by simply comparing column numbers. Is there any method that I can pass a name of the column and check how many files in the directory have that column, so that I can remove them into a different folder?
UPDATE:
I have created a dummy folder to have files to reflect the problem
please see link below to access the files on my google drive. In this folder, I have took 4 files that have the problem columns.
https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing
The problems is the code seem to be able to find files matching the selection criteria, aka the actual name of problem columns, but I cannot extract the real index of such files in the list. Any pointers?
library(data.table)
#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")
#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")
#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)
same.titles <- var.names %in% standar.names
dff.titles <- !var.names %in% standar.names
#confirm the only 3 columns of problem is column 129,130 and 131
mismatched.names <- colnames(df_var[129:131])
#visual check the names of the problematic columns
mismatched.names
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector
to_keep <- which(unlist(column_names)%in% unique_names[1])
#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]
#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )
If you can distinguish the files you'd like to keep from those you'd like to drop depending on the column names, you could use something along these lines:
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = ';',
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])
files_to_keep <- files_in_wd[to_keep]
If you have many files you should probably avoid the loop or just read in the header of the corresponding file.
edit after your comment:
by adding nrows = 2 the code only reads the first 2 rows + the header.
I assume that the first file in the folder has the structure that you'd like to keep, that's why column_names is checked against unique_names[1].
the files_to_keep contains the names of the files you'd like to keep
you could try to run that on a subset of your data and see if it works and worry about efficiency later. A vectorized approach might work better I think.
edit:
This code works with your dummy-data.
library(filesstrings)
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2,
encoding = "UTF-8",
check.names = FALSE
)
}
# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok
# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
'filename' = files_in_wd,
'keep' = NA)
for(i in 2:length(files_in_wd)){
df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}
df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns
# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept
file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")
Due to the large number and size of files it might be worth looking at alternatives to R, e.g. in bash:
for f in ctrl*.txt
do
if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
then echo "$f"
fi
done
This command compares the column names of the 'good file' to the column names of every file and prints out the names of files that do not match.

Renaming files and directories with the same pattern in R

I am trying to rename a number of files and folders with a new name.
Example old name: corrected_original_wh_ah108090.pdf
Example new name: corrected_original_gsmp01358_108090.pdf
Example old path: Data/Test2/ARGOS/wh_ah108090/crawl/corrected_original_wh_ah108090.pdf
Example new path:Data/Test2/ARGOS/gsmp01358_108090/crawl/corrected_original_gsmp01358_108090.pdf
Example metadata:
old new
wh_ah108090 gsmp01358_108090
wh_ah108091 gsmp01359_108091
wh_ah108092 gsmp01360_108092
wh_ah108093 gsmp01361_108093
wh_ah108096 gsmp01362_108096
wh_ah108102 gsmp01363_108102
wh_ah108106 gsmp01364_108106
Code:
# Read metadata for ID's #
meta <- read.csv('Metadata.csv')
# list all file paths
pathLs <- list.files('Data/Test2/', recursive = TRUE, full.names = TRUE)
# select only files with old format on the list (for full dataset where some files already have new name)
tbl<- pathLs [!grepl("gsmp", pathLs )]
# select only files with old format on metadata
metadata<- meta[!meta$old =="",]
# function to change old names for new
fileList <- apply(metadata,1,
function(x) {
fnam <- x['old']
fnam <- as.character(unlist(fnam))
newnam <- gsub(fnam, as.character(unlist(x['new'])), tbl[grepl(fnam, tbl)])
return(newnam)})
# Create dataframe with old and new names
to <- as.character(unlist(fileList))
from <- tbl
# Use rename
file.rename(from, to)
For some reason this file rename doesn't work.
Is this because I cannot rename files and directories in a path at the same time?
No loops required.
metadata <- read.table(header=T, stringsAsFactors=F, text="
old new
wh_ah108090 gsmp01358_108090
wh_ah108091 gsmp01359_108091
wh_ah108092 gsmp01360_108092
wh_ah108093 gsmp01361_108093
wh_ah108096 gsmp01362_108096
wh_ah108102 gsmp01363_108102
wh_ah108106 gsmp01364_108106")
metadata$new2 <- sprintf("gsmp%05d_%s",
1357L + seq_len(nrow(metadata)), # 1357 can be anything?
gsub("\\D", "", metadata$old))
metadata
# old new new2
# 1 wh_ah108090 gsmp01358_108090 gsmp01358_108090
# 2 wh_ah108091 gsmp01359_108091 gsmp01359_108091
# 3 wh_ah108092 gsmp01360_108092 gsmp01360_108092
# 4 wh_ah108093 gsmp01361_108093 gsmp01361_108093
# 5 wh_ah108096 gsmp01362_108096 gsmp01362_108096
# 6 wh_ah108102 gsmp01363_108102 gsmp01363_108102
# 7 wh_ah108106 gsmp01364_108106 gsmp01364_108106
file.rename(metadata$old, metadata$new2) # should do it
list.files does not list any directory name, so your code renames only the files, but not the directories. So, theoretically your code should work. Specifically, which part of the code is not working?

Moving folders to different folders

I have a directory (Windows machine) containing many folders. I’d like to split these folders into batches of three, and move them into separate sub-directories.
ID <- c("a", "b", "c", "d", "e", "f")
group <- c("gp1", "gp1","gp1","gp2","gp2","gp2")
samples <- as.data.frame(cbind(ID,group))
id group
1 a gp1
2 b gp1
3 c gp1
4 d gp2
5 e gp2
6 f gp2
So my working directory contains the folders a-f, and I want to move files a-c into a subdirectory called gp1, and d-f into a subdirectory called gp2. (I actually have over a hundred of these folders, this is just a small example, and each folder contains multiple large files).
This is what I have so far:
# find number of samples
nSamps <- nrow(samples)
# calculate how many groups are required
nGrps <- ceiling(nrow(samples)/3)
# list of batch files we want to create
Batchlist <- 1:nGrps
# create folders with appropriate batch number
for (i in Batchlist){
dir.create(paste("batch",i,sep=""))
}
# assign a group name to each sample
fileList <- rep(Batchlist, each = 3, len = nrow(samples))
# assign each sample a folder name
samples$group <- paste("batch",fileList, sep = "")
This is where I get stuck. I tried writing a loop that moves each folder to the appropriate sub-directory, however it is moving all folders, not in batches (so I'm getting copies of folders a-f in "batch1" and "batch2")
for (j in samples$group){
for (i in samples$ID){
setwd(paste("file/path","/",j,sep = ""))
dir.create(file.path(i))
setwd("../")
file.rename(from = i, to = paste(j,"/", i, sep = ""))
}
}
I've tried a few other things (like writing a small function and using sapply) but the loop is the closest I'm getting.
Can anyone point me in the right direction of where I'm going wrong?
This worked for me. Instead of your last loop, use:
for (i in 1:nrow(samples)){
dir=which(dir()==samples$group[i])
dir.create(paste0(dir()[dir],"/",samples$ID[i]))
}
The function dir() returns all existing objects in directory. Match each row to its batch name and create the directory inside of it.

Need to run R code on all text files in a folder

I have a text file. I made a R code for it to extract a certain line of information from it.
###Read file and format
txt_files <- list.files(pattern = '*.txt')
text <- lapply(txt_files, readLines)
text <- sapply(text, function(x) iconv(x, "latin1", "ASCII", sub=""))
###Search and store grep
l =grep("words" ,text)
(k<- length(l))
###Matrix to store data created
mat <- matrix(data = NA, nrow = k, ncol = 2)
nrow(mat)
###Main
for(i in 1:k){
u= 1
while(text[(l[i])-u]!=""){
line.num=u;
u=u+1
}
mat[i,2]<-text[(l[i])-u-1]
mat[i,1]<- i
}
###Write the output file
write.csv(mat, file = "Evalutaion.csv")
It runs on one file at a time. I need to run it on many files and append all the results in a single file with an additional column that tells me the name of the file from which each of the result has come. I am unable to come up with some solution. What changes do I make?
Applying your operations to all files in a folder:
txt_files <- list.files(pattern = '*.txt')
# Applying all your functions on all txt_files using for loop, you need to use indexes inside where ever you are using txt_files
for (i in 1:length(txt_files)) {
# Operation 1
# Operation 2
# Operation 3
write.table(mat,file=paste0("./",sub(".txt","",FILES[i]),".csv"),row.names=F,quote=F,sep=",")
}
Merging files together with same headers, I have two csv files with Same Header Data and Value, File Names were File1.csv and File2.csv inside Header folder, which I am merging together to get one header and all rows and columns. Make sure both the files have same number of columns and same headers in same order.
## Read into a list of files, an Example below
library(plyr)
library(gdata)
setwd("./Header") # CSV Files to be merged are in this direcory
## Read into a list of files:
filenames <- list.files(path="./",pattern="*.csv")
fullpath=file.path("./",filenames)
print (filenames)
print (fullpath)
dataset <- do.call("rbind",lapply(filenames,FUN=function(files){read.table(files,sep=",",header=T)}))
dataset
# Data Value
# 1 ABC 23
# 2 PQR 33
# 3 MNP 43 # Till here was File.csv
# 4 AC 24
# 5 PQ 34
# 6 MN 44 # Till here was File2.csv
write.table(dataset,file="dataset.csv",sep=",",quote=F,row.names=F,col.names=T)

Merging a bunch of csv files into one with headers

I have a couple of csv files I want to combine as a list then output as one merged csv. Suppose these files are called file1.csv, file2.csv, file3.csv, etc...
file1.csv # example of what each might look like
V1 V2 V3 V4
12 12 13 15
14 12 56 23
How would I create a list of these csvs so that I can output a merged csv that would have headers as the file names and the column names at the top as comments? So a csv that would look something like this in Excel:
# 1: V1
# 2: V2
# 3: V3
# 4: V4
file1.csv
12 12 13 15
14 12 56 23
file2.csv
12 12 13 15
14 12 56 23
file3.csv
12 12 13 15
14 12 56 23
I am trying to use the list function inside of a double for loop to merge these csvs together, write each list to a variable, and write each variable to a table output. however this does not not work as intended.
# finding the correct files in the directory
files <- dir("test files/shortened")
files_filter <- files[grepl("*\\.csv", files)]
levels <- unique(gsub( "-.*$", "", files_filter))
# merging
for(i in 1:length(levels)){
level_specific <- files_filter[grepl(levels[i], files_filter)]
bindme
for(j in 1:length(level_specific)){
bindme2 <- read.csv(paste("test files/shortened/",level_specific[j],sep=""))
bindme <- list(bindme,bindme2)
assign(levels[i],bindme)
}
write.table(levels[i],file = paste(levels[i],"-output.csv",sep=""),sep=",")
}
Looking at your code, I think you don't need a for-loop. With the data.table package you could do it as follows:
filenames <- list.files(pattern="*.csv")
files <- lapply(filenames, fread) # fread is the fast reading function from the data.table package
merged_data <- rbindlist(files)
write.csv(merged_data, file="merged_data_file.csv", row.names=FALSE)
If at least one of the csvs has column names set, they will be used in the resulting datatable.
Considering your code, it could be improved considerably. This:
files <- dir("test files/shortened")
files_filter <- files[grepl("*\\.csv", files)]
can be replaced by just:
filenames <- list.files(pattern="*.csv")
In your for-loop the first time you call bindme, it isn't doing anything. What is it? A list? A dataframe? You could use something like:
bindme <- data.table() # or data.frame()
Furthermore, the part:
write.table(levels[i],file = paste(levels[i],"-output.csv",sep=""),sep=",")
will generate several csv-files, but you wanted just one merged file.
Would this help
mergeMultipleFiles <- function(dirPath, nameRegex, outputFilename){
filenames <- list.files(path=dirPath, pattern=nameRegex, full.names=TRUE, recursive=T)
dataList <- lapply(filenames, read.csv, header=T, check.names=F)
combinedData <- ldply(dataList, rbind)
write.csv(combinedData, outputFilename)
}
ps: There is a regex thrown in for filenames. Just in case you want to only merge certain "pattern" of files.
Modify this example. If I understood your question correctly it will help you.
# get the names of the csv files in your current directory
file_names = list.files(pattern = "[.]csv$")
# for every name you found go and read the csv with that name
# (this creates a list of files)
import_files = lapply(file_names, read.csv)
# append those files one after the other (collapse list elements to one dataset) and save it as d
d=do.call(rbind, import_files)

Resources