I have a directory full of .xlsx files. They all have multiple sheets. I want to extract the same sheet from all of the files and append them into a tibble.
I have found numerous solutions for extracting multiple sheets from a single Excel file; however, not a single sheet from multiple files.
I have tried:
paths = as.tibble(list.files("data/BAH", pattern = ".xlsx", full.names = TRUE, all.files = FALSE))
test <- paths %>% read_xlsx(sheet = "Portal", col_names = TRUE)
I know the "paths" variable contains all of my file names with path. However, I am not sure how to iterate through each file name appending just the specific sheet = "Portal" to a csv file.
The error is:
Error: path must be a string
I have tried to pass in paths as a vector, as a tibble, and tried sub-scripting it as well. All fails.
So, in summary. I have a directory of xlsx files and I need to extract a single sheet from each one and append it to a csv file. I have tried using purrr with some map functions but also could not get it to work.
My goal was to use the Tidy way.
Thanks for any hints.
You have to use lapply() or map(). Try
test <- lapply(paths, read_xlsx, sheet = "Portal", col_names = TRUE)
or
library(purrr)
test <- map_dfr(paths, read_xlsx, sheet = "Portal", col_names = TRUE)
You can then bind the dataframes with
library(dplyr)
test %>% bind_rows()
library(tidyverse)
library(readxl)
library(fs)
# Get all files
xlsx_files <- fs::dir_ls("data/BAH", regexp = "\\.xlsx$")
paths = as_tibble(list.files("data/BAH", pattern = ".xlsx", full.names = TRUE, all.files = FALSE))
#portal_tabs <- map_dfr(paths, read_xlsx, sheet = "Portal", col_names = TRUE)
portal_tabs <- map_dfr(xlsx_files, read_xlsx, sheet = "Portal", col_names = TRUE, .id = 'source')
Related
Consider one file 'C:/ZFILE' that includes many zip files.
Now, consider that each of these zip includes many csv, among which one specific csv named 'NAME.CSV', all these scattered 'NAME.CSV' being similarly named and structured (i.e., same columns).
How to rbind all these scattered csv?
The script below allows that, but a function would be more appropriate.
How to do this?
Thanks
zfile <- "C:/ZFILE"
zlist <- list.files(path = zfile, pattern = "\\.zip$", recursive = FALSE, full.names = TRUE)
zlist # list all zip from the zfile file
zunzip <- lapply(zlist, unzip, exdir = zfile) # unzip all zip in the zfile file (may takes time depending on the number of zip)
library(data.table) # rbindlist & fread
csv_name <- "NAME.CSV"
csv_list <- list.files(path = zfile, pattern = paste0("\\", csv_name, "$"), recursive = TRUE, ignore.case = FALSE, full.names = TRUE)
csv_list # list all 'NAME.CSV' from the zfile file
csv_rbind <- rbindlist(sapply(csv_list, fread, simplify = FALSE), idcol = 'filename')
You can try this type of function ( you can pass the unzip call directly to the cmd param of data.table::fread())
get_zipped_csv <- function(path) {
fnames = list.files(path,full.names = T)
rbindlist(lapply(fnames, \(f) fread(cmd = paste0("unzip -p ",f))[,src:=f]))
}
Usage:
get_zipped_csv(path = "C:\ZFILE\")
I have a script that merge all csv files in a folder.
My problem is that a new column named "...20" is created with empty data. How can I avoid that ?
Thanks for helping
My script :
folderfiles <- list.files(path = "//myserver/Depots/",
pattern = "\\.csv$",
full.names = TRUE)
data_csv <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
)
and the message :
It's difficult to debug this without access to specific files. However, you can attempt to specify the columns you want to read using the cols_only function. For example, let's assume that you only want to read the mpg column. You can do that in the following manner:
library("fs")
library("readr")
library("tidyverse")
# Generating some sample files
temp_dir_files <- path_temp("cars")
dir_create(temp_dir_files)
for (i in 1:10) {
write_csv(mtcars, file = path(temp_dir_files, paste0("cars", i, ".csv")))
}
# Selected column import
# read_* can handle a vector of paths
read_csv(
file = dir_ls(temp_dir_files, glob = "*.csv"),
col_types = cols_only(
mpg = col_double()
),
id = "input_file"
)
The cols_only specification passed to read_csv will force the read_csv to skip the remaining columns and only import the column with the matching name.
I have multiple excel files, with multiple sheets. I need to extract certain data from each sheet and combine all the data together. For one sheet I do the following:
supdata = read_excel("Data/Exercise/IDNo-03.xlsx", sheet="Supervised", skip = 2)
ID = read_excel("Data/Exercise/IDNo-03.xlsx", sheet="Measurements", col_names = FALSE)
id = as.character( ID[1,1])%>%
str_replace("Participant ", "")
mass = as.numeric(ID[3,5])
supdata = supdata%>%
mutate(ID = id, Mass = mass)
This works. I need to do this for all the files.
I've tried this:
dir_path <- "Data/Exercise/"
list = list.files(path = dir_path, "*.xlsx")
all = lapply(list, function(x){
supdata = read_excel(x, sheet="Supervised", skip = 2)
ID = read_excel(x, sheet="Measurements", col_names = FALSE)
id = as.character( ID[1,1])%>%
str_replace("Participant ", "")
mass = as.numeric(ID[3,5])
supdata = supdata%>%
mutate(ID = id, Mass = mass)
})
list identifies the relevant files in the specified path, but I get an error:
Error: `path` does not exist: ‘IDNo-03.xlsx’
What am I doing wrong? Is there another way to approach this problem?
If I can get this bit working I will then do:
dat = do.call("rbind.data.frame", all)
list.files without specifying the full.names return only the file names without the full path
list.files(file.path(getwd(), "Downloads"), pattern ="\\.csv")
#[1] "testing.csv"
If we specify the full.names
list.files(file.path(getwd(), "Downloads"), pattern ="\\.csv", full.names = TRUE)
#[1]"/Users/akrun/Downloads/testing.csv"
When we loop over those files, without the path, it looks for the file in the working directory and thus gives the error
I have multiple .xlsx files that I would like to read and combine into one file. However, each of the files contain two sheets so I would like to have one file that is all sheet 1 and then 1 file is all sheet 2.
I have used code like this before to read multiple files but it doesn't take into account different sheets.
files = list.files(path = "../input_data/",
pattern = "*.xlsx",
full.names = T)
combined_data = sapply(files, read_excel, simplify = F) %>%
rbind.fill()
I had tried adding the sheet parameter to the read_excel function but that didn't work. Any ideas? Thanks!
I recommend using the openxlsx package which is the most modern:
df1 <- purrr::map_dfr(
files,
function(x) {
openxlsx::read.xlsx(x, sheet = 1)
}
)
df2 <- purrr::map_dfr(
files,
function(x) {
openxlsx::read.xlsx(x, sheet = 2)
}
)
The structure of my directory is as follows:
Extant_Data -> Data -> Raw
-> course_enrollment
-> frpm
I have a few different function to to read in some text files and excel files respectively.
read_fun = function(path){
test = read.delim(path, sep="\t", header=TRUE, fill = TRUE, colClasses = c(rep("character",23)))
test
}
read_fun_frpm= function(path){
test = read_excel(path, sheet = 2, col_names = frpm_names)
}
I feed this into map_dfr so that the function reads in each of the files and rowbinds them.
allfiles = list.files(path = "Extant_Data/Data/Raw/course_enrollment",
pattern = "CourseEnrollment.txt",
full.names=FALSE,
recursive = T)
# Rowbind all the course enrollment data
# !!! BUT I HAVE set the working directory to a subdirectory so that it finds those files
setwd("/Extant_Data/Data/Raw/course_enrollment")
course_combined <- map_dfr(allfiles,read_fun)
allfiles = list.files(path = "Extant_Data/Data/Raw/frpm/post12",
pattern = "frpm*",
full.names=FALSE,
recursive = T)
# Rowbind all the course enrollment data
# !!!I have to change the directory AGAIN
setwd(""Extant_Data/Data/Raw/frpm/post12")
frpm_combined <- map_dfr(allfiles,read_fun_frpm)
As mentioned in the comments, I have to keep changing the working directory so that map_dfr can locate the files. I don't think this is best practice, how might I work around this so I don't have to keep changing the directory? Any suggestions appreciated. Sorry it's hard to provide a re-producible example.
Note: This throws an error.
frpm_combined <- map_dfr(allfiles,read_fun_frpm('Extant_Data/Data/Raw/frpm/post12'))