This question already has answers here:
How to import multiple .csv files at once?
(15 answers)
Closed 4 years ago.
I'm using R to visualize some data all of which is in .txt format. There are a few hundred files in a directory and I want to load it all into one table, in one shot.
Any help?
EDIT:
Listing the files is not a problem. But I am having trouble going from list to content. I've tried some of the code from here, but I get a bug with this part:
all.the.data <- lapply( all.the.files, txt , header=TRUE)
saying
Error in match.fun(FUN) : object 'txt' not found
Any snippets of code that would clarify this problem would be greatly appreciated.
You can try this:
filelist = list.files(pattern = ".*.txt")
#assuming tab separated values with a header
datalist = lapply(filelist, function(x)read.table(x, header=T))
#assuming the same header/columns for all files
datafr = do.call("rbind", datalist)
There are three fast ways to read multiple files and put them into a single data frame or data table
First get the list of all txt files (including those in sub-folders)
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.txt$",
full.names = TRUE)
1) Use fread() w/ rbindlist() from the data.table package
#install.packages("data.table", repos = "https://cran.rstudio.com")
library(data.table)
# Read all the files and create a FileName column to store filenames
DT <- rbindlist(sapply(list_of_files, fread, simplify = FALSE),
use.names = TRUE, idcol = "FileName")
2) Use readr::read_table2() w/ purrr::map_df() from the tidyverse framework:
#install.packages("tidyverse",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(tidyverse)
# Read all the files and create a FileName column to store filenames
df <- list_of_files %>%
set_names(.) %>%
map_df(read_table2, .id = "FileName")
3) (Probably the fastest out of the three) Use vroom::vroom():
#install.packages("vroom",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)
# Read all the files and create a FileName column to store filenames
df <- vroom(list_of_files, .id = "FileName")
Note: to clean up file names, use basename or gsub functions
Benchmark: readr vs data.table vs vroom for big data
Edit 1: to read multiple csv files and skip the header using readr::read_csv
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.csv$",
full.names = TRUE)
df <- list_of_files %>%
purrr::set_names(nm = (basename(.) %>% tools::file_path_sans_ext())) %>%
purrr::map_df(read_csv,
col_names = FALSE,
skip = 1,
.id = "FileName")
Edit 2: to convert a pattern including a wildcard into the equivalent regular expression, use glob2rx()
There is a really, really easy way to do this now: the readtext package.
readtext::readtext("path_to/your_files/*.txt")
It really is that easy.
Look at the help for functions dir() aka list.files(). This allows you get a list of files, possibly filtered by regular expressions, over which you could loop.
If you want to them all at once, you first have to have content in one file. One option would be to use cat to type all files to stdout and read that using popen(). See help(Connections) for more.
Thanks for all the answers!
In the meanwhile, I also hacked a method on my own. Let me know if it is any useful:
library(foreign)
setwd("/path/to/directory")
files <-list.files()
data <- 0
for (f in files) {
tempData = scan( f, what="character")
data <- c(data,tempData)
}
Related
I have a folder with 1000 .txt files, with names like file1.txt, file2.txt,....,and file1000.txt.
I want to extract a variable that is present in all the files. The problem is when reading the files, R reads the files from file1, file10, file11,...file1000 and then goes to file2, ...file299 and so on. How can I make the program read the files in a systematic manner (i.e. 1,2,3....,1000), so that it becomes easy to match the variable needed with the file number. I am using this piece of code:
list_of_files <- list.files(path = ".", recursive = TRUE,
pattern = "\\.txt$",
full.names = TRUE)
# Read all the files and create a FileName column to store filenames
DT <- rbindlist(sapply(list_of_files, fread, simplify = FALSE),
use.names = FALSE, idcol = "FileName")
If we want to order the files, do this with mixedsort/mixedorder and then read the files. Also, instead of sapply, use lapply
library(gtools)
list_of_files <- list_of_files[mixedorder(basename(list_of_files))]
I have 30 CSVs (with huge data) having 92 columns with headers, in a folder. I need to merge the data only for some specific 10 columns from all CSVs into a single df using r program. Lets say the column names as Col1,Col2,Col3, COl4....Col10. Below is my sample code ,which combines all the CSVs, irrespective to Column names.
mypath <-"C:/Blrt/B0/Mac/Output/"
setwd(mypath)
filelist <- list.files(path=mypath, pattern="*.csv", full.names=FALSE)
filelist
Almdat <- Reduce(rbind, lapply(filelist, read.csv,header=TRUE, quote = "",sep = ",",row.names = NULL))
Any support here pls.
You could try using a combination of purrr and readr from the tidyverse. read_csv from readr allows you to specific col_types and contains the function cols_only which allows you to specify which columns to load and the types you want them to be loaded as (the example below uses col_guess() but you can be more specific if you wish).
map_dfr from the package purrr replaces the lapply, Reduce and rbind. The result is a tibble combining the rows of all dataframes loaded.
library(tidyverse)
filelist <- list.files(path = "C:/Blrt/B0/Mac/Output/", pattern = ".csv", full.names = TRUE)
Almdat <- map_dfr(filelist,
read_csv,
col_types = cols_only(Col1 = col_guess(),
Col2 = col_guess(),
Col3 = col_guess())
The example above uses only three columns, you can add as many as you like to you call to cols_only().
You can try :
cols <- paste0('Col', 1:10)
Almdat <- do.call(rbind, lapply(filelist, function(x)
read.csv(x, quote = "",row.names = NULL)[cols]))
Or using tidyverse functions :
Almdat <- purrr::map_df(filelist, ~read.csv(.x, quote = "",row.names = NULL) %>%
dplyr::select(cols))
This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 14 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)
This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 15 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)
I have a base folder and it has many folders in it. I want to go to each folder, find a file that has name table_amzn.csv (if exists) and then read all of those files in R and put all files in a single dataframe one after other. I have verified that all files have same columns. I know how to read CSVs into R. But how could i loop over all the folders within a base folder and concatenate data
This also can be straightforward in base R:
## change `dir` to whatever your 'base folder' actually is
dir <- '~/base_folder'
ff <- list.files(dir, pattern = "table_amzn.csv", recursive = TRUE, full.names = TRUE)
out <- do.call(rbind, lapply(ff, read.csv))
In the event that your columns are the same but for whatever reason (typo, etc) have different column names, you could modify the above like:
out <- do.call(rbind, lapply(ff, read.csv, header = FALSE, skip = 1))
names(out) <- c('stub1', 'stub2') # whatever they should be
Here is an implementation that was recently added to the package rio:
files <- list.files(pattern = "table_amzn.csv", recursive = TRUE, full.names = TRUE)
devtools::install_github("leeper/rio")
library(rio)
df <- import_list(files, rbind = TRUE)
This will load all the objects in files to a single data.frame object. Alternatively, if you call with rbind = FALSE then a list of data.frames is returned