Merging multiple csv files with name sequence - r

I am trying to merge 700+ csv files in R. I was able to merge them successfully using the code:
library(dplyr)
library(readr)
df <- list.files(full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
Now my problem is the file names are saved as flux.0, flux.1, flux.2......flux.733. The R binds the files in order of flux.0, flux.1, flux.10, flux.100, flux.101...and so on. Since the sequence of the file is important for me, can you suggest to incorporate this in the above code?
Many Thanks for the help!

My pipeline for things like that is to get the list of all the files (as you did), turn it into a tbl/data.frame and than using mapto read the files and unnest() them. That why I can keep the path/file name for each file I loaded.
require(tidyverse)
df <- list.files(path = "path",
full.names = TRUE,
recursive = TRUE,
pattern = "*.csv") %>%
tbl_df() %>%
mutate(data = map(value, read.csv)) %>%
arrange(value) %>%
unnest(data)

Here you have another answer using your own approach. I've just added a function that reads the csv and adds a new column called 'file' with the name of the file without the extension.
library(dplyr)
library(readr)
df <- list.files(full.names = TRUE) %>%
lapply(function(x) {a <- read_csv(x);
mutate(a, file = tools::file_path_sans_ext(basename(x)))}) %>%
bind_rows

Related

Define one column only when reading in a list of csv files?

I'm reading in a list of files and combining into a single df with:
fi <- list.files(path = 'data', pattern = '*.csv', full.names = T) %>%
map_df(~read_csv(paste0(.)))
But this is hitting an error:
Error: Can't combine App ID <character> and App ID <double>. .
Within the directory in question there are many files to read in and combine. I'd like to know if it's possible to define the column type for the problem field, App ID while reading in like this? E.g. map_df(~read_csv(paste0(.), cols(App ID = CHR))
We can use
library(dplyr)
library(purrr)
out <- list.files(path = 'data', pattern = '*\\.csv',
full.names = TRUE) %>%
map_df(~read_csv(.x) %>%
mutate(across(everything(), as.character())) %>%
type.convert(as.is = TRUE)

How to read in datasets from multiple directories and merge all datasets in each directory in R?

I am trying to read in multiple data files from different directories, and I want to merge all files in each directory and list them into the environment. The data I am working on is NHANES.
For example, if only do this to NHANES 1999-2000.
nhanes_names <- dir(here("data","raw", "NHANES","1999-2000"))
nhanes_directory <- here("data","raw", "NHANES", "1999-2000", nhanes_names)
nhanes_1999to2000<-nhanes_directory %>%
set_names(nhanes_names) %>%
map(read_rds) %>%
reduce(full_join, by = "seqn")
I am wondering if there is a tidy way to apply the code to all folders (1999-2000, 2001-2002,..., 2013-2014) using purrr.
You could something in line with:
library(purrr)
library(here)
library(readr)
library(dplyr)
list.files(here("data","raw", "NHANES"),
recursive = TRUE, full.names = TRUE) %>% ## get all files
grep("\\.rds$", ., value = TRUE) %>% ## reduce to rds files
map(read_rds) %>% ## read in each
reduce(full_join, by = "seqn") ## and merge them
Return merged files per directory in a list:
library(purrr)
library(here)
library(readr)
library(dplyr)
list.dirs(here("data","raw", "NHANES"), recursive = FALSE) %>% ## get all subdirs
map(function(sdir) { ## map over all subdirs
fns <- list.files(sdir, "\\.rds$", full.names = TRUE) ## get all files in sdir
map(fns, read_rds) %>% ## read each
reduce(full_join, by = "seqn") ## and merge them
})

How can I read multiple csvs and retain the number in the file name for each?

I have multiple csv files in a folder none of which have a header. I want to preserve the order set out by the number at the end of the file. The file names are "output-1.csv", "output-2.csv" and so on. Is there a way to include the file name of each csv so I know which data corresponds to which file. The answer [here][1] gets close to what I want.
library(tidyverse)
#' Load the data ----
mydata <-
list.files(path = "C:\\Users\\Documents\\Manuscripts\\experiment1\\output",
pattern = "*.csv") %>%
map_df( ~ read_csv(., col_names = F))
mydata
You can use:
library(tidyverse)
mydata <- list.files("C:\\Users\\Documents\\Manuscripts\\experiment1\\output",
pattern = ".csv$", full.names = T) %>%
set_names(str_sub(basename(.), 1, -5)) %>%
map_dfr(read_csv, .id = "file")

Remove column using list.files in R

So, I've got the following script
output <- list.files(pattern = "some_files.csv", recursive = TRUE) %>%
lapply(read_csv) %>%
bind_rows
and it's works perfect find all posibble files csvsand making one big file, but i encountered the following problem, one csv file generate error: Error: Column `some_column` can't be converted from numeric to character. And I decided to remove this column from the dataset
output <- list.files(pattern = "some_files.csv", recursive = TRUE) %>%
lapply(read_csv) %>% subset(read_csv, select = -c(some_column)) %>%
bind_rows
that genearte another error
Error in subset.default(., read_csv, select = -c(some_column)) :
'subset' must be logical
Any idea?
Try this
list.files(pattern = "some_files.csv", recursive = TRUE) %>%
purrr::map_df(~{
x <- readr::read_csv(.)
x[setdiff(names(x), "some_column")]
})
Instead of lapply we use map, to avoid bind_rows at the end we can use map_df or map_dfr. Instead of subset we use setdiff to remove columns since subset would fail in case if the column is not present.
Or keeping everything in base R
file_names <- list.files(pattern = "some_files.csv", recursive = TRUE)
do.call(rbind, lapply(file_names, function(x) {
df <- read.csv(x)
df[setdiff(names(df), "some_column")]
))

Saving filenames of a list of imported files into a column of a data frame

I'm reading and combining a large group of csv tables into R but before merging them all I'll like to create a column with the name of the file where those specific set of rows belong.
Here is an example of the code I wrote to read the list of files:
archivos <- list.files("proyecciones", full.names = T)
#proyecciones is the folder where all the csv files are located.
tbl <- lapply(archivos, read.table, sep="", head = T) %>% bind_rows()
As you can see I already have the names of the files in "archivos" but still haven't been able to figure it out how to put it into the lapply command.
Thanks!
We need to use the .id in bind_rows
lapply(archivos, read.table, sep="", header = TRUE) %>%
set_names(archivos) %>%
bind_rows(.id = 'grp')
A more tidyverse syntax would be
library(tidyverse)
map(archivos, read.table, sep='', header = TRUE) %>%
setnames(archivos) %>%
bind_rows(.id = 'grp')

Resources