I have a folder with 1000 .txt files, with names like file1.txt, file2.txt,....,and file1000.txt.
I want to extract a variable that is present in all the files. The problem is when reading the files, R reads the files from file1, file10, file11,...file1000 and then goes to file2, ...file299 and so on. How can I make the program read the files in a systematic manner (i.e. 1,2,3....,1000), so that it becomes easy to match the variable needed with the file number. I am using this piece of code:
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.txt$",
full.names = TRUE)
# Read all the files and create a FileName column to store filenames
DT <- rbindlist(sapply(list_of_files, fread, simplify = FALSE),
use.names = FALSE, idcol = "FileName")
If we want to order the files, do this with mixedsort/mixedorder and then read the files. Also, instead of sapply, use lapply
library(gtools)
list_of_files <- list_of_files[mixedorder(basename(list_of_files))]
Related
Merge .txt files from different sub directories
I have a folder that is filled with sub folders of past dates (01_14 for example), inside each date folder there are 11 files named 01.txt, 02.txt... How can I merge all the .txt files into one data frame, with one column with the name of the folder from where it came from and a column with the name of file from where it came from?
My hierarchy would look something like this:
\Data
\01_14
01.txt
02.txt
...
11.txt
\02_14
01.txt
02.txt
...
11.txt
\03_14
01.txt
02.txt
...
11.txt
When I need to read multiple files, i use a read.stack helper function which is basically a wrapper to read.table but it allows you to also add extra columns on a per-file basis. Here's how I might use it with your scenario.
dir<-"Data"
subdir<-list.dirs(dir, recursive=F)
#get dir/file names
ff<-do.call(rbind, lapply(subdir, function(x) {
ff<-list.files(x, "\\.txt$", include.dirs = FALSE, full.names = TRUE)
data.frame(dir=basename(x), file=basename(ff),
fullpath=ff, stringsAsFactors=F)
}))
#read into data.frame
read.stack(ff$fullpath, extra=list(file=ff$file, dir=ff$dir))
Try this:
fileNames <- list.files("Data", recursive = TRUE, full.names = TRUE)
fileContents <- lapply(fileNames, function(fileName)
paste(readLines(fileName, warn = FALSE), collapse = "\n"))
meta <- regmatches(fileNames, regexec(".*Data/(.*)/(.*)$", fileNames))
merged <- mapply(c, fileContents, lapply(meta, "[", -1), SIMPLIFY = FALSE)
as.data.frame(t(do.call(cbind, merged)))
Consider one file 'C:/ZFILE' that includes many zip files.
Now, consider that each of these zip includes many csv, among which one specific csv named 'NAME.CSV', all these scattered 'NAME.CSV' being similarly named and structured (i.e., same columns).
How to rbind all these scattered csv?
The script below allows that, but a function would be more appropriate.
How to do this?
Thanks
zfile <- "C:/ZFILE"
zlist <- list.files(path = zfile, pattern = "\\.zip$", recursive = FALSE, full.names = TRUE)
zlist # list all zip from the zfile file
zunzip <- lapply(zlist, unzip, exdir = zfile) # unzip all zip in the zfile file (may takes time depending on the number of zip)
library(data.table) # rbindlist & fread
csv_name <- "NAME.CSV"
csv_list <- list.files(path = zfile, pattern = paste0("\\", csv_name, "$"), recursive = TRUE, ignore.case = FALSE, full.names = TRUE)
csv_list # list all 'NAME.CSV' from the zfile file
csv_rbind <- rbindlist(sapply(csv_list, fread, simplify = FALSE), idcol = 'filename')
You can try this type of function ( you can pass the unzip call directly to the cmd param of data.table::fread())
get_zipped_csv <- function(path) {
fnames = list.files(path,full.names = T)
rbindlist(lapply(fnames, \(f) fread(cmd = paste0("unzip -p ",f))[,src:=f]))
}
Usage:
get_zipped_csv(path = "C:\ZFILE\")
I have a base folder and it has many folders in it. I want to go to each folder, find a file that has name table_amzn.csv (if exists) and then read all of those files in R and put all files in a single dataframe one after other. I have verified that all files have same columns. I know how to read CSVs into R. But how could i loop over all the folders within a base folder and concatenate data
This also can be straightforward in base R:
## change `dir` to whatever your 'base folder' actually is
dir <- '~/base_folder'
ff <- list.files(dir, pattern = "table_amzn.csv", recursive = TRUE, full.names = TRUE)
out <- do.call(rbind, lapply(ff, read.csv))
In the event that your columns are the same but for whatever reason (typo, etc) have different column names, you could modify the above like:
out <- do.call(rbind, lapply(ff, read.csv, header = FALSE, skip = 1))
names(out) <- c('stub1', 'stub2') # whatever they should be
Here is an implementation that was recently added to the package rio:
files <- list.files(pattern = "table_amzn.csv", recursive = TRUE, full.names = TRUE)
devtools::install_github("leeper/rio")
library(rio)
df <- import_list(files, rbind = TRUE)
This will load all the objects in files to a single data.frame object. Alternatively, if you call with rbind = FALSE then a list of data.frames is returned
Merge .txt files from different sub directories
I have a folder that is filled with sub folders of past dates (01_14 for example), inside each date folder there are 11 files named 01.txt, 02.txt... How can I merge all the .txt files into one data frame, with one column with the name of the folder from where it came from and a column with the name of file from where it came from?
My hierarchy would look something like this:
\Data
\01_14
01.txt
02.txt
...
11.txt
\02_14
01.txt
02.txt
...
11.txt
\03_14
01.txt
02.txt
...
11.txt
When I need to read multiple files, i use a read.stack helper function which is basically a wrapper to read.table but it allows you to also add extra columns on a per-file basis. Here's how I might use it with your scenario.
dir<-"Data"
subdir<-list.dirs(dir, recursive=F)
#get dir/file names
ff<-do.call(rbind, lapply(subdir, function(x) {
ff<-list.files(x, "\\.txt$", include.dirs = FALSE, full.names = TRUE)
data.frame(dir=basename(x), file=basename(ff),
fullpath=ff, stringsAsFactors=F)
}))
#read into data.frame
read.stack(ff$fullpath, extra=list(file=ff$file, dir=ff$dir))
Try this:
fileNames <- list.files("Data", recursive = TRUE, full.names = TRUE)
fileContents <- lapply(fileNames, function(fileName)
paste(readLines(fileName, warn = FALSE), collapse = "\n"))
meta <- regmatches(fileNames, regexec(".*Data/(.*)/(.*)$", fileNames))
merged <- mapply(c, fileContents, lapply(meta, "[", -1), SIMPLIFY = FALSE)
as.data.frame(t(do.call(cbind, merged)))
This question already has answers here:
How to import multiple .csv files at once?
(15 answers)
Closed 4 years ago.
I'm using R to visualize some data all of which is in .txt format. There are a few hundred files in a directory and I want to load it all into one table, in one shot.
Any help?
EDIT:
Listing the files is not a problem. But I am having trouble going from list to content. I've tried some of the code from here, but I get a bug with this part:
all.the.data <- lapply( all.the.files, txt , header=TRUE)
saying
Error in match.fun(FUN) : object 'txt' not found
Any snippets of code that would clarify this problem would be greatly appreciated.
You can try this:
filelist = list.files(pattern = ".*.txt")
#assuming tab separated values with a header
datalist = lapply(filelist, function(x)read.table(x, header=T))
#assuming the same header/columns for all files
datafr = do.call("rbind", datalist)
There are three fast ways to read multiple files and put them into a single data frame or data table
First get the list of all txt files (including those in sub-folders)
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.txt$",
full.names = TRUE)
1) Use fread() w/ rbindlist() from the data.table package
#install.packages("data.table", repos = "https://cran.rstudio.com")
library(data.table)
# Read all the files and create a FileName column to store filenames
DT <- rbindlist(sapply(list_of_files, fread, simplify = FALSE),
use.names = TRUE, idcol = "FileName")
2) Use readr::read_table2() w/ purrr::map_df() from the tidyverse framework:
#install.packages("tidyverse",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(tidyverse)
# Read all the files and create a FileName column to store filenames
df <- list_of_files %>%
set_names(.) %>%
map_df(read_table2, .id = "FileName")
3) (Probably the fastest out of the three) Use vroom::vroom():
#install.packages("vroom",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)
# Read all the files and create a FileName column to store filenames
df <- vroom(list_of_files, .id = "FileName")
Note: to clean up file names, use basename or gsub functions
Benchmark: readr vs data.table vs vroom for big data
Edit 1: to read multiple csv files and skip the header using readr::read_csv
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.csv$",
full.names = TRUE)
df <- list_of_files %>%
purrr::set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
purrr::map_df(read_csv,
col_names = FALSE,
skip = 1,
.id = "FileName")
Edit 2: to convert a pattern including a wildcard into the equivalent regular expression, use glob2rx()
There is a really, really easy way to do this now: the readtext package.
readtext::readtext("path_to/your_files/*.txt")
It really is that easy.
Look at the help for functions dir() aka list.files(). This allows you get a list of files, possibly filtered by regular expressions, over which you could loop.
If you want to them all at once, you first have to have content in one file. One option would be to use cat to type all files to stdout and read that using popen(). See help(Connections) for more.
Thanks for all the answers!
In the meanwhile, I also hacked a method on my own. Let me know if it is any useful:
library(foreign)
setwd("/path/to/directory")
files <-list.files()
data <- 0
for (f in files) {
tempData = scan( f, what="character")
data <- c(data,tempData)
}