I am trying to merge various .csv files into one dataframe using the following:
df<- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>% lapply(read_csv) %>% bind_rows clean
However, I get an error saying that I can't combine X character variable and X double variable.
Is there a way I can transform one of them to a character or double variable?
Since each csv file is slightly different, from my beginners standpoint, believe that lapply would be the best in this case, unless there is an easier way to get around this.
Thank you everyone for your time and attention!
You can change the X variable to character in all the files. You can also use map_df to combine all the files in one dataframe.
library(tidyverse)
result <- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>%
map_df(~read_csv(.x) %>% mutate(X = as.character(X)))
If there are more columns with type mismatch issue you can change all the columns to character, combine the data and use type_convert to change their class.
result <- list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>%
map_df(~read_csv(.x) %>% mutate(across(.fns = as.character))) %>%
type_convert()
if all file has same number of columns, then try plyrr::rbind.fill instead of dplyr::bind_rows.
list.files(path = "C:/Users...", pattern = "*.csv", full.names = TRUE) %>% lapply(read_csv) %>% plyrr::rbind.fill
Related
I'm reading in a list of files and combining into a single df with:
fi <- list.files(path = 'data', pattern = '*.csv', full.names = T) %>%
map_df(~read_csv(paste0(.)))
But this is hitting an error:
Error: Can't combine App ID <character> and App ID <double>. .
Within the directory in question there are many files to read in and combine. I'd like to know if it's possible to define the column type for the problem field, App ID while reading in like this? E.g. map_df(~read_csv(paste0(.), cols(App ID = CHR))
We can use
library(dplyr)
library(purrr)
out <- list.files(path = 'data', pattern = '*\\.csv',
full.names = TRUE) %>%
map_df(~read_csv(.x) %>%
mutate(across(everything(), as.character())) %>%
type.convert(as.is = TRUE)
This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 14 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)
This is a tricky dplyr & purrr question
I want to simplify the following code into one dplyr pipe:
filenames <- list.files(path = data.location, pattern = "*.csv") %>%
map_chr(function(name) gsub(paste0('(.*).csv'), '\\1', name))
files.raw <- list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
map(read_csv) %>%
setNames(filenames)
I tried to do this solution but it failed because the filenames must be used with full path (full.names = TRUE) for read_csv() but I want to assign the filenames without the full path.
In other words, this worked - but only with full path in filenames:
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{ . ->> filenames } %>%
map(read_csv) %>%
setNames(filenames)
but this didn't:
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{ map_chr(., function(name) gsub(paste0(data.location, '/(.*).csv'), '\\1', name)) ->> filenames } %>%
map(read_csv) %>%
setNames(filenames)
Is there a way to make the map_chr work with the save (->> filenames), or is there an even simpler way to completely avoid saving to a temporary variable (filenames)?
To do it in one pipeline without intermediate values, and similar to #Ronak Shah, why not set the names first, then read in the CSVs? Ronak nests the setNames call, but it can be put it in the pipeline to make it more readable:
library(tidyverse)
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
setNames(., sub("\\.csv$", "", basename(.))) %>%
map(read_csv)
We can do this with only tidyverse functions
library(readr)
library(purrr)
library(dplyr)
all_files <- list.files(path = data.location, pattern = "*\\.csv", full.names = TRUE)
map(all_files, read_csv) %>%
set_names(str_remove(basename(all_files), "\\.csv$"))
Try using this method :
all_files <- list.files(path = data.location, pattern = "*.csv", full.names = TRUE)
purrr::map(all_files, readr::read_csv) %>%
setNames(sub("\\.csv$", "", basename(all_files)))
Here, we first get complete path of all files, use it to read it using read_csv. We can use basename to get only file name and remove "csv" from it and assign names using setNames.
To do this in one-pipe, we can do
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{setNames(map(., read_csv), sub("\\.csv$", "", basename(.)))}
Inspired by
the below answer by #Ronak Shah
the intermediary assignment suggested by #G. Grothendiek here
I put together the following solution which
combines the reading and naming of files into one dplyr pipe (my initial goal)
makes the code more intuitive to read (my always implicit goal) - important for collaborators or in general.
So here's my one-in-all dplyr pipe:
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{ filenames <<- . } %>%
map(read_csv) %>%
setNames(filenames %>% basename %>% map(~gsub(paste0('(.*).csv'), '\\1', .x)))
The only tricky part in the above solution is to use the assignment operator <<- as <- would not work. If you want to use the latter, you can do so by putting the whole second section into brackets - also suggested by #G. Grothendieck in the above post:
list.files(path = data.location, pattern = "*.csv", full.names = TRUE) %>%
{
{ filenames <- . } %>%
map(read_csv) %>%
setNames(filenames %>% basename %>% map(~gsub(paste0('(.*).csv'), '\\1', .x)))
}
So, I've got the following script
output <- list.files(pattern = "some_files.csv", recursive = TRUE) %>%
lapply(read_csv) %>%
bind_rows
and it's works perfect find all posibble files csvsand making one big file, but i encountered the following problem, one csv file generate error: Error: Column `some_column` can't be converted from numeric to character. And I decided to remove this column from the dataset
output <- list.files(pattern = "some_files.csv", recursive = TRUE) %>%
lapply(read_csv) %>% subset(read_csv, select = -c(some_column)) %>%
bind_rows
that genearte another error
Error in subset.default(., read_csv, select = -c(some_column)) :
'subset' must be logical
Any idea?
Try this
list.files(pattern = "some_files.csv", recursive = TRUE) %>%
purrr::map_df(~{
x <- readr::read_csv(.)
x[setdiff(names(x), "some_column")]
})
Instead of lapply we use map, to avoid bind_rows at the end we can use map_df or map_dfr. Instead of subset we use setdiff to remove columns since subset would fail in case if the column is not present.
Or keeping everything in base R
file_names <- list.files(pattern = "some_files.csv", recursive = TRUE)
do.call(rbind, lapply(file_names, function(x) {
df <- read.csv(x)
df[setdiff(names(df), "some_column")]
))
This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 15 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)