Iterate over multiple subdirectories to read.csv of a specific file - r

I have a folder with over 100 sub-folders that each contain a specific csv "cats.csv" that I need to read into R.
so far I've got:
parent_folder <- "path of parent files"
sub_folders <- list.dirs(parent_folder, recursive = TRUE)[-1]
cat_files <- dir(sub_folders, recursive = TRUE, full.names = TRUE, pattern = "cats")
I've then tried variations of lapply and map to apply read.csv to load in all of the cat_files but it doesn't seem to work.

filelist <- list.files(pattern = "cats.csv", recursive = TRUE, full.names = TRUE)
then
lapply(setNames(nm=filelist), read.csv)
edit with thanks to r2evans below

We get the paths using Sys.glob (check that to be sure it is what you want) and then use Map to get a named list, DFs, of data.frames with the files' contents.
paths <- parent_folder |>
file.path("*", "cats.csv") |>
Sys.glob()
DFs <- Map(read.csv, paths)

Related

How to rbind similar csv files that are scattered in many different zip files, using a function?

Consider one file 'C:/ZFILE' that includes many zip files.
Now, consider that each of these zip includes many csv, among which one specific csv named 'NAME.CSV', all these scattered 'NAME.CSV' being similarly named and structured (i.e., same columns).
How to rbind all these scattered csv?
The script below allows that, but a function would be more appropriate.
How to do this?
Thanks
zfile <- "C:/ZFILE"
zlist <- list.files(path = zfile, pattern = "\\.zip$", recursive = FALSE, full.names = TRUE)
zlist # list all zip from the zfile file
zunzip <- lapply(zlist, unzip, exdir = zfile) # unzip all zip in the zfile file (may takes time depending on the number of zip)
library(data.table) # rbindlist & fread
csv_name <- "NAME.CSV"
csv_list <- list.files(path = zfile, pattern = paste0("\\", csv_name, "$"), recursive = TRUE, ignore.case = FALSE, full.names = TRUE)
csv_list # list all 'NAME.CSV' from the zfile file
csv_rbind <- rbindlist(sapply(csv_list, fread, simplify = FALSE), idcol = 'filename')
You can try this type of function ( you can pass the unzip call directly to the cmd param of data.table::fread())
get_zipped_csv <- function(path) {
fnames = list.files(path,full.names = T)
rbindlist(lapply(fnames, \(f) fread(cmd = paste0("unzip -p ",f))[,src:=f]))
}
Usage:
get_zipped_csv(path = "C:\ZFILE\")

Specifying pathname in map_dfr

The structure of my directory is as follows:
Extant_Data -> Data -> Raw
-> course_enrollment
-> frpm
I have a few different function to to read in some text files and excel files respectively.
read_fun = function(path){
test = read.delim(path, sep="\t", header=TRUE, fill = TRUE, colClasses = c(rep("character",23)))
test
}
read_fun_frpm= function(path){
test = read_excel(path, sheet = 2, col_names = frpm_names)
}
I feed this into map_dfr so that the function reads in each of the files and rowbinds them.
allfiles = list.files(path = "Extant_Data/Data/Raw/course_enrollment",
pattern = "CourseEnrollment.txt",
full.names=FALSE,
recursive = T)
# Rowbind all the course enrollment data
# !!! BUT I HAVE set the working directory to a subdirectory so that it finds those files
setwd("/Extant_Data/Data/Raw/course_enrollment")
course_combined <- map_dfr(allfiles,read_fun)
allfiles = list.files(path = "Extant_Data/Data/Raw/frpm/post12",
pattern = "frpm*",
full.names=FALSE,
recursive = T)
# Rowbind all the course enrollment data
# !!!I have to change the directory AGAIN
setwd(""Extant_Data/Data/Raw/frpm/post12")
frpm_combined <- map_dfr(allfiles,read_fun_frpm)
As mentioned in the comments, I have to keep changing the working directory so that map_dfr can locate the files. I don't think this is best practice, how might I work around this so I don't have to keep changing the directory? Any suggestions appreciated. Sorry it's hard to provide a re-producible example.
Note: This throws an error.
frpm_combined <- map_dfr(allfiles,read_fun_frpm('Extant_Data/Data/Raw/frpm/post12'))

Include pattern in list.dirs

surely a very newbish question, but how do I include a pattern inside a list.dirs function?
For example, list.files function
Imagery=list.files(full.names=TRUE, recursive=TRUE, pattern= "*20m*.tif$")
returns all the files that have 20m in their name and have .tif as extension.
But when i try to apply this logic to list.dirs
directories=list.dirs(full.names = TRUE, recursive=TRUE, pattern="R10m" )
i get this error:
Error in list.dirs(full.names = TRUE, recursive = TRUE, pattern = "R10m") :
unused argument (pattern = "R10m")
Hope I am not missing something obvious here.
My goal is to get the full path of all directories that have a folder named "R10m". I have a lot of folder that have many subdirectories, and most of them have similar structure. I would like to list only those that have this folder, and within them list all files that are tifs. I know I can get the files I need with only list.files options, but I need the directory path and file names later as variables.
Thank you beforehand for your time,
Best regards,
Davor
Three alternatives:
dirs <- list.dirs()
dirs <- dirs[ grepl(your_pattern, dirs) ]
or
dirs <- list.dirs()
dirs <- grep(your_pattern, dirs, value = TRUE)
or
files <- list.files(pattern = your_pattern, recursive = TRUE, include.dirs = TRUE)
dirs <- files[ file.info(files)$isdir ]
dir, unlike list.dirs provides that functionality:
dir(path = ".", pattern = NULL, all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
In your example:
directories <- dirs(full.names = TRUE, recursive=TRUE, pattern="R10m")
Yes, I also find it strange that there are 2 base functions to list directories, one of which, despite the name similarity with list.files doesn't provide the same like for like functionality. If someone knows the reason for this I would be very interested in knowing.
Update
After Gregor's comment, I decided to create a reproducible example to test my solution:
test_dirs <- c(
paste0(c(1:3), "R10m", rep("a", 3)),
paste0(c(1:3), "R200m", rep("a", 3))
)
for (test_dir in test_dirs){
dir.create(test_dir)
}
list.dirs()
[1] "." "./1R10ma" "./1R200ma" [4]
"./2R10ma" "./2R200ma" "./3R10ma" [7]
"./3R200ma" "./solo_kit-figure"
dir()
[1] "1R10ma" "1R200ma" "2R10ma" "2R200ma"
[5] "3R10ma" "3R200ma" "a1.bed" "a2.bed"
[9] "a.bed" "solo_kit-figure" "solo_kit.md"
dir(pattern = "R10m")
# dir(pattern = "*R10m")
# also works
"1R10ma" "2R10ma" "3R10ma"
dir also lists files, so if the pattern fits both files and directories it might be a problem, but I guess that for most application it will work fine.

r looping through folders and searching for file and then concatenate data

I have a base folder and it has many folders in it. I want to go to each folder, find a file that has name table_amzn.csv (if exists) and then read all of those files in R and put all files in a single dataframe one after other. I have verified that all files have same columns. I know how to read CSVs into R. But how could i loop over all the folders within a base folder and concatenate data
This also can be straightforward in base R:
## change `dir` to whatever your 'base folder' actually is
dir <- '~/base_folder'
ff <- list.files(dir, pattern = "table_amzn.csv", recursive = TRUE, full.names = TRUE)
out <- do.call(rbind, lapply(ff, read.csv))
In the event that your columns are the same but for whatever reason (typo, etc) have different column names, you could modify the above like:
out <- do.call(rbind, lapply(ff, read.csv, header = FALSE, skip = 1))
names(out) <- c('stub1', 'stub2') # whatever they should be
Here is an implementation that was recently added to the package rio:
files <- list.files(pattern = "table_amzn.csv", recursive = TRUE, full.names = TRUE)
devtools::install_github("leeper/rio")
library(rio)
df <- import_list(files, rbind = TRUE)
This will load all the objects in files to a single data.frame object. Alternatively, if you call with rbind = FALSE then a list of data.frames is returned

How do you read multiple .txt files into R? [duplicate]

This question already has answers here:
How to import multiple .csv files at once?
(15 answers)
Closed 4 years ago.
I'm using R to visualize some data all of which is in .txt format. There are a few hundred files in a directory and I want to load it all into one table, in one shot.
Any help?
EDIT:
Listing the files is not a problem. But I am having trouble going from list to content. I've tried some of the code from here, but I get a bug with this part:
all.the.data <- lapply( all.the.files, txt , header=TRUE)
saying
Error in match.fun(FUN) : object 'txt' not found
Any snippets of code that would clarify this problem would be greatly appreciated.
You can try this:
filelist = list.files(pattern = ".*.txt")
#assuming tab separated values with a header
datalist = lapply(filelist, function(x)read.table(x, header=T))
#assuming the same header/columns for all files
datafr = do.call("rbind", datalist)
There are three fast ways to read multiple files and put them into a single data frame or data table
First get the list of all txt files (including those in sub-folders)
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.txt$",
full.names = TRUE)
1) Use fread() w/ rbindlist() from the data.table package
#install.packages("data.table", repos = "https://cran.rstudio.com")
library(data.table)
# Read all the files and create a FileName column to store filenames
DT <- rbindlist(sapply(list_of_files, fread, simplify = FALSE),
use.names = TRUE, idcol = "FileName")
2) Use readr::read_table2() w/ purrr::map_df() from the tidyverse framework:
#install.packages("tidyverse",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(tidyverse)
# Read all the files and create a FileName column to store filenames
df <- list_of_files %>%
set_names(.) %>%
map_df(read_table2, .id = "FileName")
3) (Probably the fastest out of the three) Use vroom::vroom():
#install.packages("vroom",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)
# Read all the files and create a FileName column to store filenames
df <- vroom(list_of_files, .id = "FileName")
Note: to clean up file names, use basename or gsub functions
Benchmark: readr vs data.table vs vroom for big data
Edit 1: to read multiple csv files and skip the header using readr::read_csv
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.csv$",
full.names = TRUE)
df <- list_of_files %>%
purrr::set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
purrr::map_df(read_csv,
col_names = FALSE,
skip = 1,
.id = "FileName")
Edit 2: to convert a pattern including a wildcard into the equivalent regular expression, use glob2rx()
There is a really, really easy way to do this now: the readtext package.
readtext::readtext("path_to/your_files/*.txt")
It really is that easy.
Look at the help for functions dir() aka list.files(). This allows you get a list of files, possibly filtered by regular expressions, over which you could loop.
If you want to them all at once, you first have to have content in one file. One option would be to use cat to type all files to stdout and read that using popen(). See help(Connections) for more.
Thanks for all the answers!
In the meanwhile, I also hacked a method on my own. Let me know if it is any useful:
library(foreign)
setwd("/path/to/directory")
files <-list.files()
data <- 0
for (f in files) {
tempData = scan( f, what="character")
data <- c(data,tempData)
}

Resources