applying a function to different sheets in different workbooks in R - r

I am new to R and am currently trying to apply a function to worksheets of the same index in different workbooks using the package XLConnect.
So far I have managed to read in a folder of excel files:
filenames <- list.files( "file path here", pattern="\\.xls$", full.names=TRUE)
and I have looped through the different files, reading each worksheet of each file
for (i in 1:length(filenames)){
tmp<-loadWorkbook(file.path(filenames[i],sep=""))
lst<- readWorksheet(tmp,
sheet = getSheets(tmp), startRow=5, startCol=1, header=TRUE)}
What I think I want to do is to loop through the files in filenames and then take the worksheets with the same index (eg. the 1st worksheet of the all the files, then the 2nd worksheet of all the files etc.) and save these to a new workbook (the first workbook containing all the 1st worksheets, then a second workbook with all the 2nd worksheets etc.), with a new sheet for each original sheet that was taken from the previous files and then use
for (sheet in newlst){
Count<-t(table(sheet$County))}
and apply my function to the parameter Count.
Does anyone know how I can do this or offer me any guidance at all? Sorry if it is not clear, please ask and I will try to explain further! Thanks :)

If I understand your question correctly, the following should solve your problem:
require(XLConnect)
# Example: workbooks w1.xls - w[n].xls each with sheets S1 - S[m]
filenames = list.files("file path here", pattern = "\\.xls$", full.names = TRUE)
# Read data from all workbooks and all worksheets
data = lapply(filenames, function(f) {
wb = loadWorkbook(f)
readWorksheet(wb, sheet = getSheets(wb)) # read all sheets in one go
})
# Assumption for this example: all workbooks contain same number of sheets
nWb = sapply(data, length)
stopifnot(diff(range(nWb)) == 0)
nWb = min(nWb)
for(i in seq(length.out = nWb)) {
# List of data.frames of all i'th sheets
dout = lapply(data, "[[", i)
# Note: write all collected sheets in one go ...
writeWorksheetToFile(file = paste0("wout", i, ".xls"), data = dout,
sheet = paste0("Sout", seq(length.out = length(data))))
}

using gdata/xlsx might be for the best. Even though it's the slowest out there, it's one of the more intuitive ones.
This method fails if the excels are sorted differently.
Given there's no real example, here's some food for thought.
Gdata requires perl to be installed.
library(gdata)
filenames <- list.files( "file path here", pattern="\\.xls$", full.names=TRUE)
amountOfSheets <- #imput something
#This snippet gets the first sheet out of all the files and combining them
readXlsSheet <- function(whichSheet, files){
for(i in seq(files)){
piece <- read.xls(files[i], sheet=whichSheet)
#rbinding frames should work if the sheets are similar, use merge if not.
if(i == 1) complete <- piece else complete <- rbind(complete, piece)
}
complete
}
#Now looping through all the sheets using the previous method
animals <- lapply(seq(amountOfSheets), readXlsSheet, files=filenames)
#The output is a list where animals[[n]] has the all the nth sheets combined.
xlsx might only work on 32bit R, and has it's own fair share of issues.
library(xlsx)
filenames <- list.files(pattern="\\.xls$", full.names=TRUE)
amountOfSheets <- 2#imput something
#This snippet gets the first sheet out of all the files and combining them
readXlsSheet <- function(whichSheet, files){
for(i in seq(files)){
piece <- read.xlsx(files[i], sheetIndex=whichSheet)
#rbinding frames should work if the sheets are similar, use merge if not.
if(i == 1) complete <- piece else complete <- rbind(complete, piece)
}
complete
}
readXlsSheet(1, filenames)
#Now looping through all the sheets using the previous method
animals <- lapply(seq(amountOfSheets), readXlsSheet, files=filenames)

Related

R: Loop to Read Multiple Excel Files in List [duplicate]

This question already has answers here:
How to import multiple .csv files at once?
(15 answers)
Closed 2 years ago.
I have two lists, one with the excel file paths that I would like to read and another list with the file names that I would like to assign to each as a dataframe. Trying to create a loop using the below code but the loop only creates a single dataframe with name n. Any idea how to make this work?
files <- c("file1.xlsx","file2.xlsx")
names <- c('name1','name2')
for (f in files) {
for (n in names) {
n <- read_excel(path = f)
}
}
You are overwriting n on each iteration of the loop
Edit:
#Parfait commented that we shouldn't use assign if we can avoid it, and he is right (e.g. why-is-using-assign-bad)
This does not use assign and puts the data in a neat list:
files <- c("file1.xlsx","file2.xlsx")
names <- c('name1','name2')
result <- list()
for (i in seq_along(files)) {
result[names[i]] <- read_excel(path = files[i]))
}
Old and not recommended answer (only left here for transparency reasons):
We can use assign to use a character string as variable name:
files <- c("file1.xlsx","file2.xlsx")
names <- c('name1','name2')
for (i in seq_along(files)) {
assign(names[i], read_excel(path = files[i]))
}
An alternative is to loop through all Excel files in a folder, rather than a list. I'm assuming they exist in some kind of folder, somewhere.
# load names of excel files
files = list.files(path = "C:/your_path_here/", full.names = TRUE, pattern = ".xlsx")
# create function to read multiple sheets per excel file
read_excel_allsheets <- function(filename, tibble = FALSE) {
sheets <- readxl::excel_sheets(filename)
sapply(sheets, function(f) as.data.frame(readxl::read_excel(filename, sheet = f)),
simplify = FALSE)
}
# execute function for all excel files in "files"
all_data <- lapply(files, read_excel_allsheets)
Updated...

Assigning the filename and sheet name to (multiple) observations in R

I have written a function that, after giving the direction of the folder, takes all the excel files inside it and merges them into a data frame with some modest modifications.
Yet I have two small things I would like to add but struggle with:
Each file has a country code in the name, and I would like the function to create an additional column in the data frame, "Country", where each observation would be assigned such country code. name example: BGR_CO2_May12
Each file is composed of many sheets, with each sheet representing the year; these sheets are also called by these years. I would like the function to create another column, "Year", where each observation would be assigned the name of the sheet that it comes from.
Is there a neat way to do it? Possibly without modifying the current function?
multmerge_xls_TEST <- function(mypath2) {
library(dplyr)
library(readxl)
library(XLConnect)
library(XLConnectJars)
library(stringr)
# This function gets the list of files in a given folder
re_file <- ".+\\.xls.?"
testFiles <- list.files(path = mypath2,
pattern = re_file,
full.names = TRUE)
# This function rbinds in a single dataframe the content of multiple sheets in the same workbook
# (assuming that all the sheets have the same column types)
# It also removes the first sheet (no data there)
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, function(sheet) {
readWorksheet(wb, sheet)
})
)
}
# Getting a single dataframe for all the Excel files and cutting out the unnecessary variables
result <- do.call(rbind, lapply(testFiles, rbindAllSheets))
result <- result[,c(1,2,31)]
Try making a wrapper around readWorksheet(). This would store the file name into the variable Country and the sheet name into Year. You would need to do some regex on the file though to get the code only.
You could also skip the wrapper and simply add the mutate() line within your current function. Note this uses the dplyr package, which you already have referenced.
read_worksheet <- function(sheet, wb, file) {
readWorksheet(wb, sheet) %>%
mutate(Country = file,
Year = sheet)
}
So then you could do something like this within the function you already have.
rbindAllSheets <- function(file) {
wb <- loadWorkbook(file)
removeSheet(wb, sheet = 1)
sheets <- getSheets(wb)
do.call(rbind,
lapply(sheets, read_worksheet, wb = wb, file = file)
)
}
As another note, bind_rows() is another dplyr function which can take the place of your do.call(rbind, ...) calls.
bind_rows(lapply(sheets, read_worksheet, wb = wb, file = file))

How to 'read.csv' many files in a folder using R?

How can I read many CSV files and make each of them into data tables?
I have files of 'A1.csv' 'A2.csv' 'A3.csv'...... in Folder 'A'
So I tried this.
link <- c("C:/A")
filename<-list.files(link)
listA <- c()
for(x in filename) {
temp <- read.csv(paste0(link , x), header=FALSE)
listA <- list(unlist(listA, recursive=FALSE), temp)
}
And it doesn't work well. How can I do this job?
Write a regex to match the filenames
reg_expression <- "A[0-9]+"
files <- grep(reg_expression, list.files(directory), value = TRUE)
and then run the same loop but use assign to dynamically name the dataframes if you want
for(file in files){
assign(paste0(file, "_df"),read.csv(file))
}
But in general introducing unknown variables into the scope is bad practice so it might be best to do a loop like
dfs <- list()
for(index in 1:length(files)){
file <- files[index]
dfs[index] <- read.csv(file)
}
Unless each file is a completely different structure (i.e., different columns ... the number of rows does not matter), you can consider a more efficient approach of reading the files in using lapply and storing them in a list. One of the benefits is that whatever you do to one frame can be immediately done to all of them very easily using lapply.
files <- list.files(link, full.names = TRUE, pattern = "csv$")
list_of_frames <- lapply(files, read.csv)
# optional
names(list_of_frames) <- files # or basename(files), if filenames are unique
Something like sapply(list_of_frames, nrow) will tell you how many rows are in each frame. If you have something more complex,
new_list_of_frames <- lapply(list_of_frames, function(x) {
# do something with 'x', a single frame
})
The most immediate problem is that when pasting your file path together, you need a path separator. When composing file paths, it's best to use the function file.path as it will attempt to determine what the path separator is for operating system the code is running on. So you want to use:
read.csv(files.path(link , x), header=FALSE)
Better yet, just have the full path returned when listing out the files (and can filter for .csv):
filename <- list.files(link, full.names = TRUE, pattern = "csv$")
Combining with the idea to use assign to dynamically create the variables:
link <- c("C:/A")
files <-list.files(link, full.names = TRUE, pattern = "csv$")
for(file in files){
assign(paste0(basename(file), "_df"), read.csv(file))
}

Looping read.xlsx for reading in multiple sheets and saving them as separate DF´s

I want to read in sheets 3:8 of my excel file and save them separately.
I got something like this:
for (y in 2012:2017){
save("Year" ,y)<- for (i in 3:8)
{
read_xlsx("/Users/.../Desktop/Kriminalität.xlsx", sheet = i , skip = 4)
}
I am guessing you want to save 5 sheets as separate dataframes and name them as Year2012,Year2013....Year2017.
Create an empty list and read the sheets as elements.Name these elements accordingly and then unlist to get separate dataframes
library(openxlsx)
x=list()
for(i in 3: 8){
x[[i]]=read.xlsx("check.xlsx",sheet = i,colNames = T)
}
names(x)=paste0("Year",c(2012:2017))
list2env(x,envir=.GlobalEnv)
setwd("your directory to excel ")
library(readxl)
data="Kriminalität.xlsx" # your excel name
n=8 #number of the sheets in your excel
for (i in 3:n){
y=paste("sheet",i,sep="")
assign(y, read_xlsx(data, sheet = i ,skip = 4))
}
in section for (i in 3:n) you could define your sheets to read, for example for (i in 3:8) means read sheet3 to sheet8
results for a excel with 3 sheets for (i in 1:3):

Read specific sheets from an excel file list

I need to read specific sheets from a list of excel files.
I have >500 excel files, ".xls" and ".xlsx".
Each file can have different sheets, but I just want to read each sheets containing a specific expresion, like pattern = "^Abc", and not all files have sheets with this pattern.
I've created a code to read one file, but when I try to translate to multiple files, allways returns an error.
# example with 3rd file
# 2 sheets have the pattern
list_excels <- list.files(path = "path_to_folder", pattern = ".xls*"
sheet_names <- excel_sheets(list_excels[[3]])
list_sheets <- lapply(excel_sheets(list_excels[[3]]), read_excel, path = list_excels[[3]])
names(list_sheets) <- sheet_names
do.call("rbind", list_sheets[grepl(pattern = "^Abc", sheet names)])
But when I try to code for read multiple excels files, I have an error or something in the loop that slows a lot the computation.
There are some examples
This is a loop that doesn't return an error, but takes 30 seconds at least for each element of the list, I've never waited to finishing .
for (i in seq_along(list_excels)) {
sheet_names <- excel_sheets(list_excels[[i]])
list_sheets <- lapply(excel_sheets(list_excels[[i]]), read_excel, path = list_excels[[i]])
names(list_sheets) <- sheet_names[i] list_sheets[grepl(pattern = "^Abc", sheet_names)]
}
In this loop is missing the final part, the merging sheets with this code
list_sheets[grepl(pattern = "^Abc", sheet_names)]
I've tried to sum the rows of each sheet and store it in an vector, but I think that the loop is broken when there is a sheet that doesn't have the pattern.
x <- c()
for(i in seq_along(list_excels)) {
x[i] <- nrow(do.call("rbind",
lapply(excel_sheets(list_excels[[i]]),
read_excel,
path = list_excels[[i]])[grepl(pattern = "^Abc",
excel_sheets(list_excels[[i]]))]))
Also with purrr library, trying to read all, the same result with first loop example.
list_test <- list()
for(i in seq_along(list_excels)) {
list_test[[i]] <- excel_sheets(list_excels[[i]]) %>%
set_names() %>%
map(read_excel, path = list_excels[[i]])
}
Last example, that works with one excel file, but not with multiple. Just reading named sheet.
# One file works
data.frame(readWorksheetFromFile(list_excels[[1]], sheet = "Abc"))
#Multiple file returns an error
for(i in seq_along(list_excels)) {
data.frame(readWorksheetFromFile(list_excels[[i]], sheet = "Abc"))
#Returns the following error
#Error: IllegalArgumentException (Java): Sheet index (-1) is out of range (0..1)
Some one could help me?

Resources