Basically I have two Excel files with the same name "Checklist" in two different folder (one is 2018 and the other one is 2019). Checklist has different sheets, one for each month : "January", "February" etc... Of course, all the sheets have exactly the same variables.
I would like to put in the same data frame all the sheets from both Excel files.
For now, I can gather the sheets from one Excel File with :
library(readxl)
library(tibble)
read_excel_allsheets <- function(filename, tibble = TRUE) {
sheets <- readxl::excel_sheets(filename)
x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
if(!tibble) x <- lapply(x, as.data.frame)
names(x) <- sheets
x
}
mysheets <-read_excel_allsheets("C:/Users/Thiphaine/Documents/2018/Checklist.xlsx")
library(dplyr)
mysheets<-bind_rows(mysheets, .id = "column_label")
I just don't know how to create a loop that will go through the folder 2018 and 2019 to gather all the sheets from both Excel file. The idea will also be that in 2020, I will have another folder "2020" that should be included... Any idea? Thanks
Try this:
library(dplyr)
allsheets <- list()
for(file in list.files(path = "PATH/TO/DCUMENTS/",
recursive = TRUE, pattern = "*.xlsx", full.names = TRUE)) {
mysheets <- read_excel_allsheets(file)
mysheets <- bind_rows(mysheets, .id = "column_label")
allsheets[[file]] <- mysheets
}
where PATH/TO/DOCUMENTS is probably something like "C:/Users/Thiphaine/Documents/ for you.
If you'd like can also vectorize it using the tidyverse approach. Especially because all of your files are the same (column names) and you want to end up with a data.frame.
require(tidyverse)
df <- list.files(path = "your_path",
full.names = TRUE,
recursive = TRUE,
pattern = "*.xls") %>%
tbl_df() %>%
mutate(sheetName = map(value, readxl::excel_sheets)) %>%
unnest(sheetName) %>%
mutate(myFiles = purrr::map2(value, sheetName, function(x,y) {
readxl::read_excel(x, sheet = paste(y))})) %>%
unnest(myFiles)
Related
I have discovered R a couple of years ago and it has been very handy to clean up dataframes, prepare some data and to handle other basic tasks.
Now I would like to try using R to apply basic treatments but on many different files stored in different folders at once.
Here is the script I would like to improve into one function that would loop through my folder "dataset_2006" and "dataset_2007" to do all the work.
library(dplyr)
library(readr)
library(sf)
library(purrr)
setwd("C:/Users/Downloads/global_data/dataset_2006")
shp2006 <- list.files(pattern = 'data_2006.*\\.shp$', full.names = TRUE)
listOfShp <- lapply(shp2006, st_read)
combinedShp <- do.call(what = sf:::rbind.sf, args=listOfShp)
#import and merge CSV files into one data frame
folderfiles <- list.files(pattern = 'csv_2006_.*\\.csv$', full.names = TRUE)
csv_data <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
.id = "file_name")
new_shp_2006 <- merge(combinedShp, csv_data , by = "ID") %>% filter(label %in% c("AR45T", "GK879"))
st_write(new_shp_2006 , "new_shp_2006.shp", overwrite = TRUE)
setwd("C:/Users/Downloads/global_data/dataset_2007")
shp2007 <- list.files(pattern = 'data_2007.*\\.shp$', full.names = TRUE)
listOfShp <- lapply(shp2007, st_read)
combinedShp <- do.call(what = sf:::rbind.sf, args=listOfShp)
#import and merge CSV files into one data frame
folderfiles <- list.files(pattern = 'csv_2007_.*\\.csv$', full.names = TRUE)
csv_data <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
.id = "file_name")
new_shp_2007 <- merge(combinedShp, csv_data , by = "ID") %>% filter(label %in% c("AR45T", "GK879"))
st_write(new_shp_2007 , "new_shp_2007.shp", overwrite = TRUE)
This is easy to achieve with a for-loop to loop over multiple items. To allow us to use wildcards, we can also add the function Sys.glob():
myfunction <- function(directories) {
for(dir in Sys.glob(directories)) {
# do something with a single dir
print(dir)
}
}
# you can specify multiple directories manually:
myfunction(c('C:/Users/Downloads/global_data/dataset_2006',
'C:/Users/Downloads/global_data/dataset_2007'))
# or use a wildcard to automatically get all files/directories that match the pattern:
myfunction('C:/Users/Downloads/global_data/dataset_200*')
I currently use the "range" option to only import the first 10 columns, then I keep the columns I need.
df <- read_excel("workbook1.xlsx", sheet = "SHEET1", range = cell_cols(1:10) )
Is there a way to import by column name?
EDIT: I'm importing the columns from about 70 different workbooks, and the columns are in slightly different locations in each one, but have the same names.
Perhaps this is an approach that would work.
This reads the full excel file, subsets to the columns you want after it has been read in, and row binds them into one big dataframe with an file column denoting which file the data came from.
library(tidyverse)
list.files(path ='path_to_files', full.names = TRUE) %>%
set_names() %>%
map_dfr(.id = 'file', .f = function(x){
read_excel(x, sheet = "SHEET1") %>%
select(Columns, You, Want, To, Select)
})
Or the base R equivalent
files_to_read <- list.files(path ='path_to_files', full.names = TRUE)
columns_to_select <- c('c1', 'c2')
list_of_dfs <- lapply(files_to_read, function(x){
read_excel(x, sheet = "SHEET1")[c(columns_to_select)]
})
one_df <- do.call('rbind', list_of_dfs)
I want to consolidate 7 excel files with several sheets in R. All sheets have the same structure. I try to use a for but the result is the last workbook or error. The code is:
files <- list.files(pattern = ".xlsx")
sheets <- excel_sheets(files)
library(xlsx)
setwd("C:/Users/User/Documents")
for(i in 1:7){
file <- files
vari <- sheets %>%
set_names() %>%
map_df(~ read_excel(path,skip = 5 ,sheet = .x), .id = "sheet")
}
Thanks...
I will offer such a solution.
You can compose a data.frame with a list of files and sheets in them. Then use map2_dfr.
library(tidyverse)
setwd("C:/tmp")
path <- list.files(".", pattern = ".xlsx", full.names = T)
df_files <- data.frame(files = path) %>%
rowwise() %>%
mutate(sheets = list(excel_sheets(files))) %>%
unnest(sheets)
df <- map2_dfr(.x = df_files$files, .y = df_files$sheets, readxl::read_xlsx)
This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
Closed 14 days ago.
I have numerous csv files in multiple directories that I want to read into a R tribble or data.table. I use "list.files()" with the recursive argument set to TRUE to create a list of file names and paths, then use "lapply()" to read in multiple csv files, and then "bind_rows()" stick them all together:
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
This approach works fine. However, I need to extract a substring from the each file name and add it as a column to the final table. I can get the substring I need with "str_extract()" like this:
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
I am stuck however on how to add the extracted substring as a column as lapply() runs through read_csv() for each file.
I generally use the following approach, based on dplyr/tidyr:
data = tibble(File = files) %>%
extract(File, "Site", "([A-Z]{2}-[A-Za-z0-9]{3})", remove = FALSE) %>%
mutate(Data = lapply(File, read_csv)) %>%
unnest(Data) %>%
select(-File)
tidyverse approach:
Update:
readr 2.0 (and beyond) now has built-in support for reading a list of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function. For example reading in csv files:
(files <- fs::dir_ls("D:/data", glob="*.csv"))
dat <- read_csv(files, id="path")
Alternatively using map_dfr with purrr:
Add the filename using the .id = "source" argument in purrr::map_dfr()
An example loading .csv files:
# specify the directory, then read a list of files
data_dir <- here("file/path")
data_list <- fs::dir_ls(data_dir, regexp = ".csv$")
# return a single data frame w/ purrr:map_dfr
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source")
# Alternatively, rename source from the file path to the file name
my_data = data_list %>%
purrr::map_dfr(read_csv, .id = "source") %>%
dplyr::mutate(source = stringr::str_replace(source, "file/path", ""))
You could use purrr::map2 here, which works similarly to mapply
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}") # same length as filenames
library(purrr)
library(dplyr)
library(readr)
stopifnot(length(filenames)==length(sites)) # returns error if not the same length
ans <- map2(filenames, sites, ~read_csv(.x) %>% mutate(id = .y)) # .x is element in filenames, and .y is element in sites
The output of map2 is a list, similar to lapply
If you have a development version of purrr, you can use imap, which is a wrapper for map2 with an index
data.table approach:
If you name the list, then you can use this name to add to the data.table when binding the list together.
workflow
files <- list.files( whatever... )
#read the files from the list
l <- lapply( files, fread )
#names the list using the basename from `l`
# this also is the step to manipuly the filesnamaes to whatever you like
names(l) <- basename( l )
#bind the rows from the list togetgher, putting the filenames into the colum "id"
dt <- rbindlist( dt.list, idcol = "id" )
You just need to write your own function that reads the csv and adds the column you want, before combining them.
my_read_csv <- function(x) {
out <- read_csv(x)
site <- str_extract(x, "[A-Z]{2}-[A-Za-z0-9]{3}")
cbind(Site=site, out)
}
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
tbl <- lapply(filenames, my_read_csv) %>% bind_rows()
You can build a filenames vector based on "sites" with the exact same length as tbl and then combine the two using cbind
### Get file names
filenames <- list.files(path, full.names = TRUE, pattern = fileptrn, recursive = TRUE)
sites <- str_extract(filenames, "[A-Z]{2}-[A-Za-z0-9]{3}")
### Get length of each csv
file_lengths <- unlist(lapply(lapply(filenames, read_csv), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths))
### Create table
tbl <- lapply(filenames, read_csv) %>%
bind_rows()
### Combine file_names and tbl
tbl <- cbind(tbl, filename = file_names)
I'm trying to merge multiple excel files into a single data.frame in R - all files are pulled from a common folder, pulling only the 2nd sheet, which will always have a specific name ('Value Assessment').
In addition be able to retain each file name in a column, so the source of merged data is maintained.
I've been able to load the files and merge into one data.frame, but can't figure out how to retain file name as 'source name'.
setwd(/.)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list,read_excel)
df <- rbindlist(df.list, idcol = "id")
Using setNames():
file.list <- list.files(pattern = '*.xlsx')
file.list <- setNames(file.list, file.list)
df.list <- lapply(file.list, read_excel, sheet = 2)
df.list <- Map(function(df, name) {
df$source_name <- name
df
}, df.list, names(df.list))
df <- rbindlist(df.list, idcol = "id")
(Note: probably a typo, you were missing sheet = 2).
Try this: Merge All Data from All Excel Files:
library(xlsx)
setwd("C:/Users/your_path_here/excel_files")
data.files = list.files(pattern = "*.xlsx")
data <- lapply(data.files, function(x) read.xlsx(x, sheetIndex = 2))
for (i in data.files) {
data <- rbind(data, read.xlsx(i, sheetIndex = 1))
}