Read specific csv files in multiple directories with R - r

I need to read specific csv files stored in multiple directories with R. Each directory contains these files (and others) which however are listed under different names but with distinct characters that make them recognisable.
Let's suppose the csv files I want to read contains the following distinct character: '1' (file 1) and '2' (file 2).
Here's the code I tried so far:
# This is the main directory where all your the sub-dir with files are stored
common_path = "~/my/main/directory"
# Extract the names of the sub-dir
primary_dirs = list.files(common_path)
# Create empty list of lists
data_lst = rep(list(list()), length(primary_dirs)) # one list per each directory
# These are the 2 files (by code) that I need to read
names_csv = c('1', '2')
#### Nested for loop reading the csv files into the list of lists
for (i in 1:length(primary_dirs)) {
for (j in 1:length(names_csv)) {
data_lst[[i]][j] = read.csv(paste('~/my/main/directory/', primary_dirs[i],
'/name_file', names_csv[j], '.csv', sep = ''))
}
}
### End of nested loop
The issue here is that the code works only if the names of the files are identical within each directory. But this is not the case. Each directory has different file names but the file names contain the distinct characters '1' and '2'.
E.g. in this case my files in all directories are called 'name_file1.csv' and 'name_file2.csv'. But in my real case the names of files are something like: dir 1 -> 'name_bla_1.csv', 'name_bla_2.csv'; dir 2 -> 'name_gya_1.csv' 'name_gya_2.csv'; etc...
How can I read these 2 files from all my directories with files having different names?
Thanks

You're making things much too complicated. list.files can search recursively (within directories), can return the full file path so you don't have to worry about pasteing together where the file path, and can match regex patterns.
files_to_read = list.files(
path = common_path, # directory to search within
pattern = ".*(1|2).*csv$", # regex pattern, some explanation below
recursive = TRUE, # search subdirectories
full.names = TRUE # return the full path
)
data_lst = lapply(files_to_read, read.csv) # read all the matching files
To learn more about regex, I'd recommend regex101.com. .*, (1|2) matches 1 or 2, and $ matches the end of the string, so ".*(1|2).*csv$" will match all strings that contain a 1 or 2 and end in csv.

If you simply want to read in any matching filenames from any subdirectories, you could try this:
regular_expression <- "name_[A-z]+_"
names_csv <- c('1', '2')
names_to_read <- paste0(regular_expression, names_csv, "\\.csv", collapse = "|")
fileList <- list.files(pattern = names_to_read, path = common_path,
recursive = TRUE, full.names = TRUE)
data_lst <- lapply(files_to_read, function(x) read.csv(x))
The output should be a list, where each entry is one of your csv files.
It wasn't clear to me if you wanted to maintain separation based on the directory each file was read from, so I haven't included that.

Related

In R is there a way to read files and check the first column of unique IDs against a predetermined list of IDs and return only those files or names?

I have a folder of a few thousand files (Both .csv and .xls) and in each of these files the first column is made up of unique ID numbers. The other fields in these files are different pieces of data that I'll need to extract with respect to that unique ID number. The catch is that I have a list of predetermined ID numbers that I need to pull the data for. Some files may or may not have 1 or more of my predetermined list of IDs in them. How do I check the first column in these files against my predetermined list of IDs and return the filenames of the files that contain 1 or more of my predetermined list of IDs?
The following should work:
library(xlsx)
library(readxl) # for read_xls
my_path="C:/Users/Desktop/my_files"
# Collect the names of the files
list_doc_csv=list.files(path = my_path, pattern = ".csv", all.files = TRUE)
list_doc_xlsx=list.files(path = my_path, pattern = ".xlsx", all.files = TRUE)
list_doc_xls=list.files(path = my_path, pattern = ".xls", all.files = TRUE)
# Step needed as .xlsx files were select as having ".xls" patterns
list_doc_xls=list_doc_xls[which(!list_doc_xls%in%list_doc_xlsx)]
# Declare ID of interest
ID_interesting=c("id1","id33","id101")
list_interesting_doc=c()
# Loop on CSV files and check the content of first column
for (doc in list_doc_csv){
column1=read.csv(file=paste0(my_path,"/",doc))[,1]
if(sum(column1%in%ID_interesting)>0){
list_interesting_doc=c(list_interesting_doc,doc)
}
}
# Loop on .xlsx files
for (doc in list_doc_xlsx){
column1=read.xlsx(file=paste0(my_path,"/",doc),sheetIndex = 1)[,1]
if(sum(column1%in%ID_interesting)>0){
list_interesting_doc=c(list_interesting_doc,doc)
}
}
# Loop on .xls files
for (doc in list_doc_xls){
column1=unlist(read_xls(path=paste0(my_path, "/", doc))[,1])
if(sum(column1%in%ID_interesting)>0){
list_interesting_doc=c(list_interesting_doc,doc)
}
}
print(list_interesting_doc)

How can I list.files() in subdirectories according a vector of file names?

I have the following example:
# Vector of names
test <- c("banana", "maca")
# Directories
from.dir <- "C:/Users/Windows 10/Documents/teste"
to.dir <- "C:/Users/Windows 10/Documents/teste2"
# Listing files and copy
files <- list.files(path = from.dir, pattern = test, recursive = T)
for (f in files) file.copy(from = f, to = to.dir)
I have a vector of names that include two names (banana and maca);
I have a directory named "teste". Inside this directory, I have 2 folders. In the first folder has an image named "banana" in the second folder has an image named "maca";
I wanna copy these two images for another directory named "teste2";
I getting an error in list.files(). It's just shown me the first name present in the first folder which is "banana". It's not shown me the name "maca", present in the second folder;
In this way, I can't use the for() to copy files.
Thank's I appreciate all help
I think you need to add an additional loop to iterate through each element in test. list.files is probably expecting a string (e.g. "banana") but instead you passed a vector
for(pattern in test){
files <- list.files(path = from.dir, pattern = pattern, recursive = T)
for (f in files) file.copy(from = f, to = to.dir)
}

Read second sheet of xlsx file from various subdirectories of a main directory R

I want to read the sheet that contains the word "All"or "all" of an excel workbook for every subdirectory based on a specific pattern.
I have tried list.files() but it does not work properly.
files_to_read = list.files(
path = common_path, # directory to search within
pattern = "X - GEN", # regex pattern, some explanation below
recursive = TRUE, # search subdirectories
full.names = TRUE # return the full path
)
data_lst = lapply(files_to_read, read.xlsx)
I am assuming your sub-directories have a similar name that can be identifiable?
Assumption, let's say:
your sub-directory starts with 'this' and
the files that are saved in sub-directory starts with the file name 'my_file'
the tab that you are trying to read in contains the word 'all'.
If the tab that you are reading in is located in same position (e.g. 2nd tab of every file) then it is easier as you can specify the sheet within read.xlsx as sheet = 2 but if this is not the case then one way you could do is by creating your own function that allows for this.
Then
library(openxlsx)
# getting the name of subdirectories starting with the word 'this'
my_dir <- list.files(pattern = "^this", full.names = TRUE)
# getting the name of the files starting with 'my_file', e.g. my_file.xlsx, my_file2.xlsx
my_files <- list.files(my_dir, pattern = "^my_file", full.names = TRUE)
my_read_xlsx <- function(files_to_read, sheets_to_read) {
# files to import
wb <- loadWorkbook(files_to_read)
# getting the sheet names that contain 'all' or any other strings that you specify
# ignore.case is there so that case is not sensitive when reading in excel tabs
ws <- names(wb)[grepl(sheets_to_read, names(wb), ignore.case = TRUE)]
# reading in the excel tab specified as above
xl_data <- read.xlsx(wb, ws)
return(xl_data)
}
# Using the function created above and import tabs containing 'all'
my_list <- lapply(my_files, FUN = function(x) my_read_xlsx(x, sheet = "ALL"))
# Converting the list into a data.frame
my_data <- do.call("rbind", my_list)

renaming existing .csv files within a folder

I have a folder called "data" which contains .csv files with data from individual participants. The filename of each of the participants .csv file is an alphanumeric code assigned to them (which is also stored in a column within the .csv called "ppt"), plus the word "data" and the hour they completed the study (e.g., 13 = 1pm).
So for example, the filename of this participant .csv would be "3ht2phfu7data13.csv"
ppt
choice
error
3ht2phfu7
d
0
3ht2phfu7
d
0
3ht2phfu7
k
1
whilst the filename of this participant .csv would be "3a5tzoirpdata15.csv"
ppt
choice
error
3a5tzoirp
k
1
3a5tzoirp
d
0
3a5tzoirp
k
1
These are just 2 examples, but there are 60 individual .csv files in total.
I am trying to rename each of these files, so that instead of containing the participant alphanumeric code, each participant is assigned a number ranging from 1 to 60. So for example, instead of an individual participant file being named "3ht2phfu7data.csv", I'd like it to be named "1data.csv", and for the ppt column to also change to be "1" for each row (to match the new filename), rather than the "3ht2phfu7" that it currently is.
Then going along with another example, for "3a5tzoirpdata.csv" to be named "2data.csv" and for the ppt column to also change to be "2" for each row (to match the new filename). And then so on with the remaining 58 .csv files in the folder.
I have tried the following code, no error message appears but it is not producing amended .csv files. Any help would be really appreciated
files <- list.files(path = 'data/', pattern = '*.csv', full.names = TRUE)
sapply(files, function(file){
x <- read.csv(file)
x$participant <- c(1:60)
write.csv(paste0(x, "data", file))
}
You had the right idea, but there were some problems in the sapply.
You can't iterate over filenames if you want to assign a consecutive number.
In the write.csv the object to write to file was missing. For the file name we first have to extract the file's directory with dirname and then add the desired filename.
files <- list.files(path = 'data/', pattern = '*.csv', full.names = TRUE)
sapply(1:length(files), function(i){
# read file
x <- read.csv(files[i])
# change participant code to consecutive number
x$participant <- i
write.csv(x, paste0(dirname(files[i]), "/", i, "data.csv"), row.names = F, quote = F)
})

How to read .xlsm files in a folder, a folder that is present in different folders using R

I have directory with a list of folders which contains a folder named "ABC" . This "ABC" has '.xlsm' files. I want to use a R code to read '.xlsm' files in the folder "ABC", which under different folders in a directory.
Thank you for your help
If you already know the paths to each file, then simply use read_excel from the readxl package:
library(readxl)
mydata <- read_excel("ABC/myfile.xlsm")
If you first need to get the paths to each file, you can use a system command (I'm on Ubuntu 18.04) to find all of the paths and store them in a vector. You can then import them one at a time:
myshellcommand <- "find /path/to/top/directory -path '*/ABC/*' -type d"
mypaths <- system(command = myshellcommand, intern = TRUE)
Because of your directory requirements, one method for finding all of the files can be a double list.files:
ld <- list.files(pattern="^ABC$", include.dirs=TRUE, recursive=TRUE, full.names=TRUE)
lf <- list.files(ld, pattern="\\.xlsm$", ignore.case=TRUE, recursive=TRUE, full.names=TRUE)
To read them all into a list (good ref for dealing with a list-of-frames: http://stackoverflow.com/a/24376207/3358272):
lstdf <- sapply(lf, read_excel, simplify=FALSE)
This defaults to opening the first sheet in each workbook. Other options in readxl::read_excel that might be useful: sheet=, range=, skip=, n_max=.
Given a list of *.xlsm files in your working directory you can do the following:
list.files(
path = getwd(),
pattern = glob2rx(pattern = "*.xlsm"),
full.names = TRUE,
recursive = TRUE
) -> files_to_read
lst_dta <- lapply(
X = files_to_read,
FUN = function(x) {
cat("Reading:", x, fill = TRUE)
openxlsx::read.xlsx(xlsxFile = x)
}
)
Results
Given two files, each with two columns A, B and C, D the generated list corresponds to:
>> lst_dta
[[1]]
C D
1 3 4
[[2]]
A B
1 1 2
Notes
This will read all .xlsm files found in the directory tree starting from getwd().
openxlsx is efficient due to the use of Rcpp. If you are going to be handling a substantial amount of MS Excel files this package is worth exploring, IMHO.
Edit
As pointed out by #r2evans in comments, you may want to read *.xlsm files that reside only within ABC folder ignoring *.xlsm files outside the ABC folder. You could filter your files vector in the following manner:
grep(pattern = "ABC", x = files_to_read, value = TRUE)
Unlikely, if you have *.xlsm files that have ABC string in names and are saved outside ABC folder you may get extra matches.

Resources