I have a folder called "data" which contains .csv files with data from individual participants. The filename of each of the participants .csv file is an alphanumeric code assigned to them (which is also stored in a column within the .csv called "ppt"), plus the word "data" and the hour they completed the study (e.g., 13 = 1pm).
So for example, the filename of this participant .csv would be "3ht2phfu7data13.csv"
ppt
choice
error
3ht2phfu7
d
0
3ht2phfu7
d
0
3ht2phfu7
k
1
whilst the filename of this participant .csv would be "3a5tzoirpdata15.csv"
ppt
choice
error
3a5tzoirp
k
1
3a5tzoirp
d
0
3a5tzoirp
k
1
These are just 2 examples, but there are 60 individual .csv files in total.
I am trying to rename each of these files, so that instead of containing the participant alphanumeric code, each participant is assigned a number ranging from 1 to 60. So for example, instead of an individual participant file being named "3ht2phfu7data.csv", I'd like it to be named "1data.csv", and for the ppt column to also change to be "1" for each row (to match the new filename), rather than the "3ht2phfu7" that it currently is.
Then going along with another example, for "3a5tzoirpdata.csv" to be named "2data.csv" and for the ppt column to also change to be "2" for each row (to match the new filename). And then so on with the remaining 58 .csv files in the folder.
I have tried the following code, no error message appears but it is not producing amended .csv files. Any help would be really appreciated
files <- list.files(path = 'data/', pattern = '*.csv', full.names = TRUE)
sapply(files, function(file){
x <- read.csv(file)
x$participant <- c(1:60)
write.csv(paste0(x, "data", file))
}
You had the right idea, but there were some problems in the sapply.
You can't iterate over filenames if you want to assign a consecutive number.
In the write.csv the object to write to file was missing. For the file name we first have to extract the file's directory with dirname and then add the desired filename.
files <- list.files(path = 'data/', pattern = '*.csv', full.names = TRUE)
sapply(1:length(files), function(i){
# read file
x <- read.csv(files[i])
# change participant code to consecutive number
x$participant <- i
write.csv(x, paste0(dirname(files[i]), "/", i, "data.csv"), row.names = F, quote = F)
})
Related
I have a folder of a few thousand files (Both .csv and .xls) and in each of these files the first column is made up of unique ID numbers. The other fields in these files are different pieces of data that I'll need to extract with respect to that unique ID number. The catch is that I have a list of predetermined ID numbers that I need to pull the data for. Some files may or may not have 1 or more of my predetermined list of IDs in them. How do I check the first column in these files against my predetermined list of IDs and return the filenames of the files that contain 1 or more of my predetermined list of IDs?
The following should work:
library(xlsx)
library(readxl) # for read_xls
my_path="C:/Users/Desktop/my_files"
# Collect the names of the files
list_doc_csv=list.files(path = my_path, pattern = ".csv", all.files = TRUE)
list_doc_xlsx=list.files(path = my_path, pattern = ".xlsx", all.files = TRUE)
list_doc_xls=list.files(path = my_path, pattern = ".xls", all.files = TRUE)
# Step needed as .xlsx files were select as having ".xls" patterns
list_doc_xls=list_doc_xls[which(!list_doc_xls%in%list_doc_xlsx)]
# Declare ID of interest
ID_interesting=c("id1","id33","id101")
list_interesting_doc=c()
# Loop on CSV files and check the content of first column
for (doc in list_doc_csv){
column1=read.csv(file=paste0(my_path,"/",doc))[,1]
if(sum(column1%in%ID_interesting)>0){
list_interesting_doc=c(list_interesting_doc,doc)
}
}
# Loop on .xlsx files
for (doc in list_doc_xlsx){
column1=read.xlsx(file=paste0(my_path,"/",doc),sheetIndex = 1)[,1]
if(sum(column1%in%ID_interesting)>0){
list_interesting_doc=c(list_interesting_doc,doc)
}
}
# Loop on .xls files
for (doc in list_doc_xls){
column1=unlist(read_xls(path=paste0(my_path, "/", doc))[,1])
if(sum(column1%in%ID_interesting)>0){
list_interesting_doc=c(list_interesting_doc,doc)
}
}
print(list_interesting_doc)
Sorry for the generic question. I'm looking for pointers for sorting out a data folder, in which I have numerous .txt files. All of them have different titles, and for the vast majority of them, the files have the same dimension, that is the column numbers are the same. However, the pain is some of the files, despite having the same number of columns, have different column names. That is in those files, some other variables were measured.
I want to weed out these files, and I cannot do by simply comparing column numbers. Is there any method that I can pass a name of the column and check how many files in the directory have that column, so that I can remove them into a different folder?
UPDATE:
I have created a dummy folder to have files to reflect the problem
please see link below to access the files on my google drive. In this folder, I have took 4 files that have the problem columns.
https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing
The problems is the code seem to be able to find files matching the selection criteria, aka the actual name of problem columns, but I cannot extract the real index of such files in the list. Any pointers?
library(data.table)
#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")
#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")
#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)
same.titles <- var.names %in% standar.names
dff.titles <- !var.names %in% standar.names
#confirm the only 3 columns of problem is column 129,130 and 131
mismatched.names <- colnames(df_var[129:131])
#visual check the names of the problematic columns
mismatched.names
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector
to_keep <- which(unlist(column_names)%in% unique_names[1])
#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]
#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )
If you can distinguish the files you'd like to keep from those you'd like to drop depending on the column names, you could use something along these lines:
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = ';',
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])
files_to_keep <- files_in_wd[to_keep]
If you have many files you should probably avoid the loop or just read in the header of the corresponding file.
edit after your comment:
by adding nrows = 2 the code only reads the first 2 rows + the header.
I assume that the first file in the folder has the structure that you'd like to keep, that's why column_names is checked against unique_names[1].
the files_to_keep contains the names of the files you'd like to keep
you could try to run that on a subset of your data and see if it works and worry about efficiency later. A vectorized approach might work better I think.
edit:
This code works with your dummy-data.
library(filesstrings)
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2,
encoding = "UTF-8",
check.names = FALSE
)
}
# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok
# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
'filename' = files_in_wd,
'keep' = NA)
for(i in 2:length(files_in_wd)){
df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}
df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns
# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept
file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")
Due to the large number and size of files it might be worth looking at alternatives to R, e.g. in bash:
for f in ctrl*.txt
do
if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
then echo "$f"
fi
done
This command compares the column names of the 'good file' to the column names of every file and prints out the names of files that do not match.
ep_dir <- "C:/Users/J/Desktop/e_prot_unicode"
reading and merging data
# reading the data. empty list that gets filled up
ep_ldf<-list()
# creates a list of all the files in the directory with ending .txt
listtxt_ep<-list.files(path = ep_dir, pattern="*.txt", full.names = T)
# loop for reading all the files in the list
for(m in 1:length(listtxt_ep)){
ep_ldf[[m]]<-read.table(listtxt_ep[m],fill=T,header=F,sep = "\t",stringsAsFactors=FALSE)
}
f_ep = "C:/Users/J/Desktop/e_prot_unicode//05AP.U1"
#reading and merging the files, data.table is then called d_ep
d_ep = data.frame()
for(f_ep in listtxt_ep){
tmp_ep <- read.delim(f_ep,row.names = NULL,sep = "\t",fileEncoding="UTF-16LE",fill = T) %>% as.data.frame(stringsAsFactors = F)
d_ep <- rbind.fill(d_ep, tmp_ep)
}
I want to read in a bunch of txt files. The above code reads in the files incorrectly. Only the first one (05AP.U1) contains all values properly. All the others are missing the values in the first column (here I do not mean the numbering row), that are the names. Why does this code only reads in the first file correctly?
I need to read specific csv files stored in multiple directories with R. Each directory contains these files (and others) which however are listed under different names but with distinct characters that make them recognisable.
Let's suppose the csv files I want to read contains the following distinct character: '1' (file 1) and '2' (file 2).
Here's the code I tried so far:
# This is the main directory where all your the sub-dir with files are stored
common_path = "~/my/main/directory"
# Extract the names of the sub-dir
primary_dirs = list.files(common_path)
# Create empty list of lists
data_lst = rep(list(list()), length(primary_dirs)) # one list per each directory
# These are the 2 files (by code) that I need to read
names_csv = c('1', '2')
#### Nested for loop reading the csv files into the list of lists
for (i in 1:length(primary_dirs)) {
for (j in 1:length(names_csv)) {
data_lst[[i]][j] = read.csv(paste('~/my/main/directory/', primary_dirs[i],
'/name_file', names_csv[j], '.csv', sep = ''))
}
}
### End of nested loop
The issue here is that the code works only if the names of the files are identical within each directory. But this is not the case. Each directory has different file names but the file names contain the distinct characters '1' and '2'.
E.g. in this case my files in all directories are called 'name_file1.csv' and 'name_file2.csv'. But in my real case the names of files are something like: dir 1 -> 'name_bla_1.csv', 'name_bla_2.csv'; dir 2 -> 'name_gya_1.csv' 'name_gya_2.csv'; etc...
How can I read these 2 files from all my directories with files having different names?
Thanks
You're making things much too complicated. list.files can search recursively (within directories), can return the full file path so you don't have to worry about pasteing together where the file path, and can match regex patterns.
files_to_read = list.files(
path = common_path, # directory to search within
pattern = ".*(1|2).*csv$", # regex pattern, some explanation below
recursive = TRUE, # search subdirectories
full.names = TRUE # return the full path
)
data_lst = lapply(files_to_read, read.csv) # read all the matching files
To learn more about regex, I'd recommend regex101.com. .*, (1|2) matches 1 or 2, and $ matches the end of the string, so ".*(1|2).*csv$" will match all strings that contain a 1 or 2 and end in csv.
If you simply want to read in any matching filenames from any subdirectories, you could try this:
regular_expression <- "name_[A-z]+_"
names_csv <- c('1', '2')
names_to_read <- paste0(regular_expression, names_csv, "\\.csv", collapse = "|")
fileList <- list.files(pattern = names_to_read, path = common_path,
recursive = TRUE, full.names = TRUE)
data_lst <- lapply(files_to_read, function(x) read.csv(x))
The output should be a list, where each entry is one of your csv files.
It wasn't clear to me if you wanted to maintain separation based on the directory each file was read from, so I haven't included that.
I have a folder full of .txt files that I want to loop through and compress into one data frame, but each .txt file is data for one subject and there are no columns in the text files that indicate subject number or time point in the study (e.g. 1-5). I need to add a line or two of code into my loop that looks for strings of four numbers (i.e. each file is labeled something like: "4325.5_ERN_No_Startle") and just creates a column with 4325 and another column with 5 that will appear for every data point for that subject until the loop gets to the next one. I have been looking for awhile but am still coming up empty, any suggestions?
I also have not quite gotten the loop to work:
path = "/Users/me/Desktop/Event Codes/ERN task/ERN text files transferred"
out.file <- ""
file <- ""
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- read.table(file.names[i],header=FALSE, fill = TRUE)
out.file <- rbind(out.file, file)
}
which runs okay until I get this error message part way through:
Error in read.table(file.names[i], header = FALSE, fill = TRUE) :
no lines available in input
Consider using regex to parse the file name for study period and subject, both of which are then binded in a lapply of list.files:
path = "path/to/text/files"
# ANY TXT FILE WITH PATTERN OF 4 DIGITS FOLLOWED BY A PERIOD AND ONE DIGIT
file.names <- list.files(path, pattern="*[0-9]{4}\\.[0-9]{1}.*txt", full.names=TRUE)
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES AND BINDS THE REGEX EXTRACTS
dfList <- lapply(file.names, function(x) {
if (file.exists(x)) {
data.frame(period=regmatches(x, gregexpr('[0-9]{4}', x))[[1]],
subject=regmatches(x, gregexpr('\\.[0-9]{1}', x))[[1]],
read.table(x, header=FALSE, fill=TRUE),
stringsAsFactors = FALSE)
}
})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# REMOVE PERIOD IN SUBJECT (NEEDED EARLIER FOR SPECIAL DIGIT)
df['subject'] <- sapply(df['subject'],
function(x) gsub("\\.", "", x))
You can try to use tryCatchwhich basically would give you a NULL instead of an error.
file <- tryCatch(read.table(file.names[i],header=FALSE, fill = TRUE), error=function(e) NULL))