Combining multiple sheets from multiple excel files to one data frame - r

I want to take the data from specific cells spread over two worksheets in one workbook, write it into one line on a new "consolidated" spreadsheet, and then repeat for all workbooks in said folder.
I'm struggling with pulling the specific cells and writing it to one line.
Cells D1, D4 and D7 need pulling on sheet 1, along with B4-F6 (rectangle) on sheet 2.
So far I can identify the correct folder, and also pull the data i need, however this is only for one named file at a time.
What I am unable to do is use read_xlsx across multiple sheets across multiple workbooks at once.
Grateful for any advice.
Some code i am (unsuccessfully) using below.
the following finds the folder
file.list <- list.files(path="FILE PATH", pattern="*.xlsx", full.names=TRUE, recursive=FALSE)
The following i can only work for one notebook at a time
Ideally i could use the "file.list" described above. Besides i can only pull a rectangle and not three specific cells without having the code repeated three times (which is not a problem if thats my only solution)
Info <- read_xlsx("FILE PATH", sheet = 1, range = "G6:G12", col_names = FALSE,
col_types = "guess" , na = "", trim_ws = TRUE, skip = 0,
# n_max = Inf, guess_max = min(1000, n_max),
progress = readxl_progress(), .name_repair = "unique")
Amount <- read_xlsx("FILE PATH", sheet = 2, range = "D4:G6", col_names = FALSE,
col_types = "numeric" , na = "", trim_ws = TRUE, skip = 0,
# n_max = Inf, guess_max = min(1000, n_max),
progress = readxl_progress(), .name_repair = "unique")
and i'm having mixed success with lapply/sapply

Related

How to solve the problem of character change that use write.xlsx() to writes data into excel document in R language?

I write a data.frame into an excel document through the function of write.xlsx. The header of the data.frame contains the characters like "95%CI", "Pr(>|W|)", etc. The data.frame is output in the r console without any problem, but when I written it into Excel file through write.xlsx(), 95% CI becomes X95.CI, and Pr(>|W|) becomes Pr...W..
How to solve this problem?
The test code is as follows:
library("openxlsx")
mydata <- data.frame("95%CI" = 1,
"Pr(>|W|)" =2)
write.xlsx(mydata,
"test.xlsx",
sheetName = "test",
overwrite = TRUE,
borders = "all", colWidths="auto")
I don't think this code works correctly in R console as well.
mydata <- data.frame("95%CI" = 1,"Pr(>|W|)" =2)
mydata
# X95.CI Pr...W..
#1 1 2
You have some non-standard characters in column names (like %, (, > etc), if you want to keep them use check.names = FALSE in data.frame function.
mydata <- data.frame("95%CI" = 1,"Pr(>|W|)" =2, check.names = FALSE)
mydata
# 95%CI Pr(>|W|)
#1 1 2
Now when you write it to excel -
openxlsx::write.xlsx(mydata,
"test.xlsx",
sheetName = "test",
overwrite = TRUE,
borders = "all", colWidths="auto")

How to do a vlookup for a few millions row, without running into memory problems?

Looking to write a code that can perform a vlookup.
I have several excel files (each containing ~1m rows (excel file limit) that act as lookup tables.
I then have an excel sheet with two columns that I need to look up - these two columns can't be merged, as the results must be kept seperated.
First I load the files that contain the tables I want to lookup in:
All <- lapply(filenames_list,function(filename){
print(paste("Merging",filename,sep = " "))
read.xlsx(filename)
})
df <- do.call(rbind.data.frame, All)
Then I load the files I want to look up:
LookUpID1 <- read.xlsx(paste(current_working_dir,"/LookUpIDs.xlsx", sep=""), sheet = 1, startRow = 1, colNames = TRUE, cols = 1, skipEmptyRows = TRUE, skipEmptyCols = TRUE)
LookUpID2 <- read.xlsx(paste(current_working_dir,"/LookUpIDs.xlsx", sep=""), sheet = 1, startRow = 1, colNames = TRUE, cols = 2, skipEmptyRows = TRUE, skipEmptyCols = TRUE)
I need to load the file twice, as I need to perform a lookup on both column 1 and 2.
And then the actual vlookup:
# Matching ID
FoundIDs1 <- merge(df, LookUpID1)
FoundIDs2 <- merge(df, LookUpID2)
FoundIDs <- merge(FoundIDs1, FoundIDs2, by = NULL)
The issue is that my PC runs out of memory when running the last part of the code (the actual Vlookup);
Error: cannot allocate vector of size 1715.0 Gb

Issue with combining multiple separate data points in R

I have X number of spreadsheets with information spreading out over two tabs.
I am looking to combine these into one data frame.
The files have 3 distinct cells on tab 1 (D6, D9, D12) and tab 2 has a grid (D4:G6) that i want to pull out of each spreadsheet into a row.
So far i have made the data frame, and pulled a list of the files. I have managed to get a for-loop working that pulls out the data from sheet1 D6, i plan to copy this code for the rest of the cells I need.
file.list <-
list.files(
path = "filepath",
pattern = "*.xlsx",
full.names = TRUE,
recursive = FALSE
)
colnames <- c( "A","B","C","etc",)
output <- matrix(NA,nrow = length(file.list), ncol = length(colnames), byrow = FALSE)
colnames(output) <- c(colnames)
rownames(output) <- c(file.list)
for (i in 1:length(file.list)) {
filename=file.list[i]
data = read.xlsx(file = filename, sheetIndex = 1, colIndex = 7, rowIndex = 6)
assign(x = filename, value = data)
}
The issue i have is that R then pulls out X number of single data points, and I am unable to bring this out as one list of multiple data points to insert in to the dataframe.

Issue downloading and opening xlsx-file from within R

I would like to download and open the following Excel-file with monthly and annual consumer price indices directly from within R.
https://www.bfs.admin.ch/bfsstatic/dam/assets/7066959/master
(the link can be found on this site: https://www.bfs.admin.ch/bfs/de/home/statistiken/preise/landesindex-konsumentenpreise/lik-resultate.assetdetail.7066959.html)
I used to download this file manually using the browser, save it locally on my computer, then open the xlsx-file with R and work with the data without any problems.
I have now tried to read the file directly from within R, but without luck so far. As you can see from the URL above, there is no .xlsx extension or the like, so I figured the file is zipped somehow. Here is what I've tried so far and where I am stuck.
library(foreign)
library(xlsx)
# in a browser, this links opens or dowloads an xlsx file
likurl <- "https://www.bfs.admin.ch/bfsstatic/dam/assets/7066959/master"
temp <- tempfile()
download.file(likurl, temp)
list.files <- unzip(temp,list=TRUE)
data <- read.xlsx(unz(temp,
+ list.files$Name[8]), sheetIndex=2)
The result from the last step is
Error in +list.files$Name[8] : invalid argument to unary operator
I do not really understand the unz function, but can see this is somehow wrong when reading the help file for unz (I found this suggested solution somewhere online).
I also tried the following, different approach:
library(XLConnect)
likurl <- "https://www.bfs.admin.ch/bfsstatic/dam/assets/7066959/master"
tmp = tempfile(fileext = ".xlsx")
download.file(likurl, tmp)
readWorksheetFromFile(tmp, sheet = 2, startRow = 4,
colNames = TRUE, rowNames = FALSE)
with the last line returning as result:
Error: ZipException (Java): invalid entry size (expected 1644 but got 1668 bytes)
I would greatly appreciate any help on how I can open this data and work with it as usual when reading in data from excel into R.
Thanks a lot in advance!
Here's my solution thanks to the hint by #Johnny. Reading the data from excel worked better with read.xlsx from the xlsx-package (instead of read_excel as suggested in the link above).
Some ugly details still remain with how the columns are named (colNames are not passed on correctly, except for the first and 11th column) and how strangely new columns are created from the options passed to read.xlsx (e.g., a column named colNames, with all entries == TRUE; for details, see the output structure with str(LIK.m)). However, these would be for another question and for the moment, they can be fixed in the quick and dirty way :-).
library(httr)
library(foreign)
library(xlsx)
# in a browser, this links opens or dowloads an xlsx file
likurl<-'https://www.bfs.admin.ch/bfsstatic/dam/assets/7066959/master'
p1f <- tempfile()
download.file(likurl, p1f, mode="wb")
GET(likurl, write_disk(tf <- tempfile(fileext = ".xlsx")))
# annual CPI
LIK.y <- read.xlsx(tf,
sheetIndex = 2, startRow = 4,
colNames = TRUE, rowNames = FALSE, stringsAsFactors = FALSE,
detectDates = FALSE, skipEmptyRows = TRUE, skipEmptyCols = TRUE ,
na.strings = "NA", check.names = TRUE, fillMergedCells = FALSE)
LIK.y$X. <- as.numeric(LIK.y$X.)
str(LIK.y)
# monthly CPI
LIK.m <- read.xlsx(tf,
sheetIndex = 1, startRow = 4,
colNames = TRUE, rowNames = FALSE, stringsAsFactors = FALSE,
detectDates = FALSE, skipEmptyRows = TRUE, skipEmptyCols = TRUE ,
na.strings = "NA", check.names = TRUE, fillMergedCells = FALSE)
LIK.m$X. <- as.numeric(LIK.m$X.)
str(LIK.m)

r- rollapply across a mutiple file database

I have a large database that I've split into multiple files. Each file is saved in the same directory, and there is a numerical sequence in the naming scheme so the order of the database is maintained. Ive done this to reduce the time and memory it takes to load and manipulate the database. I would like to start analyzing the database in sequence, which I intend to accomplish using a rollapply like function. I am having a problem when I want the window to span two files at once. Which is where I need help. Here is dummy dataset that will create five CSV files with a similar naming scheme to my database:
library(readr)
val <- c(1,2,3,4,5)
df_1 <- data.frame(val)
write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)
Keep in mind that this database is huge, and causes memory and time issues on my current machine. The solution MUST have a component that "forgets". This means recurrently joining the files, or loading them all at once to the R environment is not an option. When a new file is loaded, the last file must be removed from the R environment. I can have at maximum three files loaded at once. For example files 1-3 can be loaded, and then file 1 needs to be removed before file 4 is loaded.
The output can be a single list of all files - the combination of files 1-5 in a single list.
For the sake of simplicity, lets say I want to use a window of 2, and I want to calculate the mean of this window. I'm imagining something like this (see below) but this maybe a failed approach, and I'm open to anything.
appreciated_function <- function(x){
Your greatly appreciated function
}
rollapply(df, 2, appreciated_function, by.column = FALSE, align = "left")
Suppose the window width is k. Iterate through all files and for each one read that file plus the first k-1 rows of the next (except for the last) and use rollapply on that appending what we get to what we have so far. Alternately, if the output is too large we could write out each result instead of appending it.
At the bottom we check that it gives the expected result.
library(readr)
library(zoo)
val <- c(1,2,3,4,5)
df_1 <- data.frame(val)
write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)
d <- dir(pattern = "database.csv$")
k <- 2
r <- NULL
for(i in seq_along(d)) {
Next <- if (i != length(d)) read_csv(d[i+1], n_max = k-1)
DF <- rbind(read_csv(d[i]), Next)
r0 <- rollapply(DF, k, sum, align = "left")
# if output too large replace next statement with one to write out r0
r <- rbind(r, r0)
}
# check
r2 <- rollapply(data.frame(val = sequence(rep(5, 5))), k, sum, align = "left")
identical(r, r2)
## [1] TRUE

Resources