reading excel files into a single dataframe with readxl R - r

I have a bunch of excel files and I want to read them and merge them into a single data frame.
I have the following code:
library(readxl)
files <- list.files()
f <- list()
data_names <- gsub("[.]xls", "", files)
to read each excel file into data frames
for (i in 1:length(files)){
assign(data_names[i], read_excel(files[i], sheet = 1, skip = 6))
}
but, if I try to save it in a variable, just saved the last file
for (i in 1:length(files)){
temp <- read_excel(files[i], sheet = 1, skip = 6)
}

I would do this using plyr:
library(readxl)
library(plyr)
files <- list.files(".", "\\.xls")
data <- ldply(files, read_excel, sheet = 1, skip = 6)
If you wanted to add a column with the file name, you could instead do:
data <- ldply(files, function(fil) {
data.frame(File = fil, read_excel(fil, sheet = 1, skip = 6))
}

I would recommend to use the list enviourment in R, assign can be quite confusing and you can't determain values with GET.
Should look like this:
l <- list()
for (i in 1:length(files)){
l[[i]] <- read_excel(files[i], sheet = 1, skip = 6))
}
ltogether <- do.call("rbind",l)

Related

Reading several large files in a loop

I am trying to read several large files in a loop. So instead of doing:
library(fst)
df1 <-read_fst("C:/data1.fst", c(1:2), from = 1, to = 1000)
df2 <-read_fst("C:/data2.fst", c(1:2), from = 1, to = 1000)
df3 <-read_fst("C:/data3.fst", c(1:2), from = 1, to = 1000)
I would like to do something like this:
for(i in 1:3){
df_i <- read_fst("C:/data_i.fst", c(1:2), from = 1, to = 1000)
}
You can use list.files to generate all .fst files in a given dir and then loop through them:
files <- list.files(pattern =".fst") # .fst files in your current directory
df_list <- rep(list(NA), length(files)) # Init list of DFs
for (i in seq_along(files))
df_list[[i]] <- fst::read_fst(files[i], ...)
You could refine the pattern arg in list.files to match a certain pattern, e.g. pattern = "data_\\d+.fst" to match data_i.fst
You can also specify the directory to look into via the path arg and return the full file names via full.names
It is better using a list for the loop output like this. You can create a vector to save the dirs where the files are stored (I did on myvec and you can change 1:3 to 1:n where n can be a larger number). With that done, all the results from loop will be in List. Here the code:
library(fst)
#Create empty list
List <- list()
#Vector
myvec <- paste0("C:/data",1:3,".fst")
#Loop
for(i in 1:length(myvec))
{
List[[i]] <- read_fst(myvec[i], c(1:2), from = 1, to = 1000)
}

exporting a list from R to excel in a good format

I'm trying to use the library(xlsx) to write some data from R into excel in a readable format.
My dataset is formatted as:
tbl <- list("some_name"=head(mtcars),"some_name2"=head(iris))
I would like to write this table to excel, with each item in the list being identified and the data being next to the item. E.g. the excel file should look like
"some_name" in cell A1
paste the dataframe head(mtcars) in cell A3
"some_name2" in cell A11
paste the dataframe head(iris) in cell A13
or something similar, e.g. pasting each item into a new worksheet.
Using
write.xlsx(tbl,"output.xlsx")
will output it correctly however it is not formatted in a readable way.
Any help would be great
The following codes create a xlsx file with multiple sheets, each of which holds a list name as the sheet name and a title, and a data frame below the title. You can modify it as you like.
require(xlsx)
ls2xlsx <- function(x, wb){
for(i in 1:length(x)){
sh <- createSheet(wb, names(x[i]))
cl_title <- createCell(createRow(sh, 1), 1)
addDataFrame(x[i], sh, startRow = 2, startColumn = 1)
setCellValue(cl_title[[1, 1]], names(x[i]))
}
}
tbl <- list("some_name" = head(mtcars),"some_name2"=head(iris))
wb <- createWorkbook()
ls2xlsx(tbl, wb)
saveWorkbook(wb, 'test.xlsx')
The following function writes a list of dataframes to an .xlsx file.
It has two modes, given by argument beside.
beside = TRUE is the default. It writes just one sheet, with the dataframe name on the first row, then an empty cell, then the dataframe. And repeats this for all dataframes, written side by side.
beside = FALSE writes one dataframe per sheet. The sheets' names are the dataframes names. If the list members do not have a name, the name is given by argument sheetNamePrefix.
The .xlsx file is written in the directory given by argument file.
writeList_xlsx <- function(x, file, beside = TRUE, sheetNamePrefix = "Sheet"){
xnames <- names(x)
shNames <- paste0(sheetNamePrefix, seq_along(x))
if(is.null(xnames)) xnames <- shNames
if(any(xnames == "")){
xnames[xnames == ""] <- shNames[xnames == ""]
}
wb <- createWorkbook(type = "xlsx")
if(beside){
sheet <- createSheet(wb, sheetName = shNames[1])
row <- createRow(sheet, rowIndex = 1)
col <- 0
for(i in seq_along(x)){
col <- col + 1
cell <- createCell(row, colIndex = col)
setCellValue(cell[[1, 1]], xnames[i])
col <- col + 2
addDataFrame(x[[i]], sheet,
startRow = 1, startColumn = col,
row.names = FALSE)
col <- col + ncol(x[[i]])
}
}else{
for(i in seq_along(x)){
sheet <- createSheet(wb, sheetName = xnames[i])
addDataFrame(x[[i]], sheet, row.names = FALSE)
}
}
if(!grepl("\\.xls", file)) file <- paste0(file, ".xlsx")
saveWorkbook(wb, file = file)
}
writeList_xlsx(tbl, file = "test.xlsx")
writeList_xlsx(tbl, file = "test2.xlsx", beside = FALSE)

Error when during merging excel files in r with blank sheet

I'm using the following code to merge several excel files with multiple sheets. I get an error when it runs across a sheet that has the same header as the other files but is not populated with data. This is the error:
Error in data.frame(sub.id, condition, s.frame, ss) :
arguments imply differing number of rows: 0, 2
How can I avoid the error? Here is the code I am using below.
file.names <- list.files(pattern='*.xls')
sheet.names <- getSheets(loadWorkbook('File.xls'))
sheet.names <-sheet.names[1:12]
e.names <- paste0(rep('v', 16), c(1:16))
data.1 <- data.frame(matrix(rep(NA,length(e.names)),
ncol = length(e.names)))
names(data.1) <- e.names
for (i in 1:length(file.names)) {
wb <- loadWorkbook(file.names[i])
for (j in 1:length(sheet.names)) {
ss <- readWorksheet(wb, sheet.names[j], startCol = 2, header = TRUE)
condition <- rep(sheet.names[j], nrow(ss))
sub.id <- rep(file.names[i], nrow(ss))
s.frame <- seq(1:nrow(ss))
df.1 <- data.frame(sub.id, condition, s.frame, ss)
names(df.1) <- e.names
data.1 <- rbind(data.1, df.1)
rm(ss, condition, s.frame, sub.id, df.1)
}
rm(wb)
}
I suppose this solution will work for you. It loads all .xlsx files in a specified folder into a list of lists. Sheet-names and -headers shouldn't be an issue.
library(openxlsx)
# Define folder where your files are
path_folder <- "C:/path_to_files/"
# load file names into a list
f <- list.files(path_folder)
f <- ifelse(substring(f,nchar(f)-4,nchar(f))==".xlsx",f,NA)
f <- f[!is.na(f)]
data_list <- as.list(f)
# get sheet-names
names(data_list) <- data_list
data_list <- lapply(data_list, function(x){getSheetNames(paste0(path_folder, x))})
# load data into a list of lists
data_list <- lapply(data_list, function(x){as.list(x)})
data_list <- lapply(names(data_list),function(x){
sapply(data_list[[x]],function(y){read.xlsx(paste0(path_folder, x),sheet=y)})
})
# name the list elements
names(data_list) <- gsub(".xlsx", "", f)
You end up with a list (containing each file) of lists (containing the sheets of each file).
From here you can remove empty sheets, merge and edit them as you like.
Added an if-statement to check if there was more than one row if not skip reading in and it resolved the error.
for (i in 1:length(file.names)) {
wb <- loadWorkbook(file.names[i])
for (j in 1:length(sheet.names)) {
ss <- readWorksheet(wb, sheet.names[j], startCol = 2, header = TRUE)
if (nrow(ss) > 1)
{
condition <- rep(sheet.names[j], nrow(ss))
sub.id <- rep(file.names[i], nrow(ss))
s.frame <- seq(1:nrow(ss))
df.1 <- data.frame(sub.id, condition, s.frame, ss)
names(df.1) <- e.names
data.1 <- rbind(data.1, df.1)
rm(ss, condition, s.frame, sub.id, df.1)
}
}
rm(wb)
}

Extract data from text files using for loop

I have 40 text files with names :
[1] "2006-03-31.txt" "2006-06-30.txt" "2006-09-30.txt" "2006-12-31.txt" "2007-03-31.txt"
[6] "2007-06-30.txt" "2007-09-30.txt" "2007-12-31.txt" "2008-03-31.txt" etc...
I need to extract one specific data, i know how to do it individually but this take a while:
m_value1 <- `2006-03-31.txt`$Marknadsvarde_tot[1]
m_value2 <- `2006-06-30.txt`$Marknadsvarde_tot[1]
m_value3 <- `2006-09-30.txt`$Marknadsvarde_tot[1]
m_value4 <- `2006-12-31.txt`$Marknadsvarde_tot[1]
Can someone help me with a for loop which would extract the data from a specific column and row through all the different text files please?
Assuming your files are all in the same folder, you can use list.files to get the names of all the files, then loop through them and get the value you need. So something like this?
m_value<-character() #or whatever the type of your variable is
filelist<-list.files(path="...", all.files = TRUE)
for (i in 1:length(filelist)){
df<-read.table(myfile[i], h=T)
m_value[i]<-df$Marknadsvarde_tot[1]
}
EDIT:
In case you have imported already all the data you can use get:
txt_files <- list.files(pattern = "*.txt")
for(i in txt_files) { x <- read.delim(i, header=TRUE) assign(i,x) }
m_value<-character()
for(i in 1:length(txt_files)) {
m_value[i] <- get(txt_files[i])$Marknadsvarde_tot[1]
}
You could utilize the select-parameter from fread of the data.table-package for this:
library(data.table)
file.list <- list.files(pattern = '.txt')
lapply(file.list, fread, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)
This will result in a list of datatables/dataframes. If you just want a vector with all the values:
sapply(file.list, function(x) fread(x, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)[[1]])
temp = list.files(pattern="*.txt")
library(data.table)
list2env(
lapply(setNames(temp, make.names(gsub("*.txt$", "", temp))),
fread), envir = .GlobalEnv)
Added data.table to an existing answer at Importing multiple .csv files into R
After you get all your files you can get data from the data.tables using DT[i,j,k] where i will be your condition

Merge multiple excel files into R taking only 2nd sheet, retaining file name as 'data source'

I'm trying to merge multiple excel files into a single data.frame in R - all files are pulled from a common folder, pulling only the 2nd sheet, which will always have a specific name ('Value Assessment').
In addition be able to retain each file name in a column, so the source of merged data is maintained.
I've been able to load the files and merge into one data.frame, but can't figure out how to retain file name as 'source name'.
setwd(/.)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list,read_excel)
df <- rbindlist(df.list, idcol = "id")
Using setNames():
file.list <- list.files(pattern = '*.xlsx')
file.list <- setNames(file.list, file.list)
df.list <- lapply(file.list, read_excel, sheet = 2)
df.list <- Map(function(df, name) {
df$source_name <- name
df
}, df.list, names(df.list))
df <- rbindlist(df.list, idcol = "id")
(Note: probably a typo, you were missing sheet = 2).
Try this: Merge All Data from All Excel Files:
library(xlsx)
setwd("C:/Users/your_path_here/excel_files")
data.files = list.files(pattern = "*.xlsx")
data <- lapply(data.files, function(x) read.xlsx(x, sheetIndex = 2))
for (i in data.files) {
data <- rbind(data, read.xlsx(i, sheetIndex = 1))
}

Resources