Load multiple excel files and name object after a file name - r

I have read several questions related to this but none is what I am looking for.
The best one is by far using the readxlpackage
library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)
but as it is explained, it gives a list.
what I want is to get each file by their name in work directory
what I am doing is to get setwdinto the directory I have all the xls files then I load them one by one based on their name for example
mydf1 <- read_excel("mydf1.xlsx")
mydfb <- read_excel("mydfb.xlsx")
datac <- read_excel("datac.xlsx")
Is there any other way to get them without repeating the name over and over again?

You can use assign with for loop:
library(readxl)
file.list <- list.files(pattern = "*.xlsx")
for(i in file.list) {
assign(sub(".xlsx", "", i), read_excel(i))
}
PS.: you need sub to remove file extension (otherwise you would get object mydf1.xlsx instead of mydf1).

This is a perfect use case for the purrr package:
library(readxl)
library(tidyverse) #loads purrr
#for each excel file name, read excel sheet and append to df
df.excel <- file.names %>% map_df( ~ read_excel(path = .x))

You could use something like this in your loop:
lapply(seq_along(file.list), function(x){
df<-read_excel(x)
y<-gsub("\\..*","",x)
assign(y, df, envir=globalenv())
})

You only think that you want each one loading into the global environment. As you become more experienced with R you will find that in most (if not all) cases it is better to keep related objects like this together in a list.
If they are all in a list then you can use lapply or sapply to run the same command on every element instead of trying to create a new loop where you get each object and process it.
The list approach is less likely to overwrite other objects that you may want to keep or cause other programming at a distance bugs (which can be very hard to track down).

Building on the purrr approach by #SethRaithel, this provides column with the file names.
library(tidyverse)
library(readxl)
# create a list of files matching a regular expression
# in a defined path
file_list <- fs::dir_ls(temp_path, regexp="*.xls")
data_new <- file_list %>%
# convert to a tibble
as_tibble() %>%
# create column with just file name for reference
mutate(file = fs::path_file(value)) %>%
# uses map to read all the files and then return a single df
mutate(data = purrr::map(value, .f=readxl::read_excel)) %>%
unnest(cols=data) %>%
janitor::clean_names() %>%
select(-value)

Related

How to read and rbind all .xlsx files in a folder efficiently using read_excel

I am new to R and need to create one dataframe from 80 .xlsx files that mostly share the same columns and are all in the same folder. I want to bind all these files efficiently in a manner that would work if I added or removed files from the folder later. I want to do this without converting the files to .csv, unless someone can show me how to that efficiently for large numbers of files within R itself.
I've previously been reading files individually using the read_excel function from the readxl package. After, I would use rbind to bind them. This was fine for 10 files, but not 80! I've experimented with many solutions offered online however none of these seem to work, largely because they are using functions other than read_excel or formats other than .xlsx. I haven't kept track of many of my failed attempts, so cannot offer code other than one alternate method I tried to adapt to read_excel from the read_csv function.
#Method 1
library(readxl)
library(purr)
library(dplyr)
library(tidyverse)
file.list <- list.files(pattern='*.xlsx')
alldata <- file.list %>%
map(read_excel) %>%
reduce(rbind)
#Output
New names:
* `` -> ...2
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Any code on how to do this would be greatly appreciated. Sorry if anything is wrong about this post, it is my first one.
UPDATE:
Using the changes suggested by the answers, I'm now using the code:
file.list <- list.files(pattern='*.xlsx')
alldata <- file.list %>%
map_dfr(read_excel) %>%
reduce(bind_rows)
This output now is as follows:
New names:
* `` -> ...2
Error: Column `10.Alert.alone` can't be converted from numeric to character
This happens regardless of which type of bind() function I use in the reduce() slot. If anyone can help with this, please let me know!
You're on the right track here. But you need to use map_dfr instead of plain-vanilla map. map_dfr outputs a data frame (or actually tibble) for each iteration, and combines them via bind_rows.
This should work:
library(readxl)
library(tidyverse)
file.list <- list.files(pattern='*.xlsx')
alldata <- file.list %>%
map_dfr(~read_excel(.x))
Note that this assumes your files all have consistent column names and data types. If they don't, you may have to do some cleaning. (One trick I've used in complex cases is to add a %>% mutate_all(as.character) to the read_excel command inside the map function. That will turn everything into characters, and then you can convert the data types from there.)
this should get you there/close...
library(data.table)
library(readxl)
#create files list
file.list <- list.files( pattern = ".*\\.xlsx$", full.names = TRUE )
#read files to list of data.frames
l <- lapply( l, readxl::read_excel )
#bind l together to one larger data.table, by columnname, fill missing with NA
dt <- data.table::rbindlist( l, use.names = TRUE, fill = TRUE )
Try using map_dfr.
alldata <- file.list %>%
map_dfr(read_excel)

Fetch csv files and filter using R

Is there a faster way to fetch a bunch of csv files, merge them together (they all have the same structure) yet only keep those Values (a column) which are greater than 5?
The csv files will have thousands of rows each while typically fewer than 100 (per csv) will be greater than 5.
The working code I have is:
library(tidyverse)
filelocns <-"C:/Data/test/"
# get files list from folder
file.list <- list.files(path=filelocns, recursive=T,pattern='*.csv')
# row bind the listed CSVs and filter for Values >= 5
rows_gt5 <- lapply(paste0(filelocns,file.list),read.csv) %>%
bind_rows() %>%
filter(Value>=5)
Try whether read_csv is suitable for you, i.e. change the line
rows_gt5 <- lapply(paste0(filelocns,file.list),read.csv) %>%
to
rows_gt5 <- lapply(paste0(filelocns,file.list),read_csv) %>%
In general it is faster than read.csv.
Have a look at the docs for further details on how to use it.
Here's how I would approach this:
# source dependencies
library(dplyr)
# declare path to desired directory
filelocns <-"C:/Data/test/"
# list all of the files within this directory
file.list <- list.files(path=filelocns
,pattern='\\.csv$'
,all.files = FALSE
,full.names = TRUE
,ignore.case = FALSE
)
# apply the read_csv function to our list of files
row_gt5 <- ldply(file.list, read_csv) %>%
# and filter out values less than five
filter(Values>=5)
You can replace the read_csv function with a custom function wrapper to re-format raw data on the fly before storing it into a master dataframe.
It sounds like read_csv is all you need to get going though.

After reading multiple files into R, how can I set the resulting df's to the file name?

Currently I am using the below to read in ~7-10 files to the R console all at once.
library(magrittr)
library(feather)
list.files("C:/path/to/files",pattern="\\.feather$") %>% lapply(read_feather)
How can I pipe these into independent data objects based on their unique file names?
ex.
accounts_jan.feather
users_jan.feather
-> read feather function -> hold in working memory as:
accounts_jan_df
users_jan_df
Thanks.
This seems like a case of trying to do too much with a pipe (https://github.com/hrbrmstr/rstudioconf2017/blob/master/presentation/writing-readable-code-with-pipes.key.pdf). I would recommend segmenting your process a little:
# Get vector of files
files <- list.files("C:/path/to/files", pattern = "\\.feather$")
# Form object names
object_names <-
files %>%
basename %>%
file_path_sans_ext
# Read files and add to environment
lapply(files,
read_feather) %>%
setNames(object_names) %>%
list2env()
If you really must do this with a single pipe, you should use mapply instead, as it has a USE.NAMES argument.
list.files("C:/path/to/files", pattern = "\\feather$") %>%
mapply(read_feather,
.,
USE.NAMES = TRUE,
SIMPLIFY = FALSE) %>%
setNames(names(.) %>% basename %>% tools::file_path_sans_ext) %>%
list2env()
Personally, I find the first option easier to reason with when I go to do debugging (I'm not a fan of pipes within pipes).

How can I turn a part of the filename into a variable when reading multiple text files?

I have multiple textfiles (around 60) that I merge into a single file. I am looking for way of only adding the first 4 digits of the file name in a variable for each file. An example of a file name is 1111_2222_3333.txt.
So basically I need an additional variable that includes the first 4 digits of the file name per file.
I did find the following related topics, but this does not allow me to include the 4 four digits only:
How can I turn the filename into a variable when reading multiple csvs into R
R: Read multiple files and label them based on the file name
My code that does not include the file name yet is currently:
files <- list.files("pathname", pattern="*.TXT")
masterfilesales <- do.call(rbind, lapply(files, read.table))
Update: Although the initial answer is correct, the same goal can be achieved in fewer steps by using sapply with simplify=FALSE instead of lapply because sapply automatically assigns the filenames to the elements in the list:
library(data.table)
files <- list.files("pathname", pattern="*.TXT")
file.list <- sapply(files, read.table, simplify=FALSE)
masterfilesales <- rbindlist(file.list, idcol="id")[, id := substr(id,1,4)]
Old answer: To achieve what you want, you can utilize a combination of the setattr function and the idcol pararmeter of the rbindlist function from the data.table-package as follows:
library(data.table)
files <- list.files("pathname", pattern="*.TXT")
file.list <- lapply(files, read.table)
setattr(file.list, "names", files)
masterfilesales <- rbindlist(file.list, idcol="id")[, id := substr(id,1,4)]
Alternatively, you can set the filenames in base R with:
attr(file.list, "names") <- files
or:
names(file.list) <- files
and bind them together with bind_rows from the dplyr package (which has also an .id parameter to create an id-column):
masterfilesales <- bind_rows(file.list, .id="id") %>% mutate(id = substr(id,1,4))
Are you looking for something like this?
c("1111_444.txt", "443343iqueh.txt") -> a
substring(a, first=1, last=4)

Loop in R to read many files

I have been wondering if anybody knows a way to create a loop that loads files/databases in R.
Say i have some files like that: data1.csv, data2.csv,..., data100.csv.
In some programming languages you one can do something like this data +{ x }+ .csv the system recognizes it like datax.csv, and then you can apply the loop.
Any ideas?
Sys.glob() is another possibility - it's sole purpose is globbing or wildcard expansion.
dataFiles <- lapply(Sys.glob("data*.csv"), read.csv)
That will read all the files of the form data[x].csv into list dataFiles, where [x] is nothing or anything.
[Note this is a different pattern to that in #Joshua's Answer. There, list.files() takes a regular expression, whereas Sys.glob() just uses standard wildcards; which wildcards can be used is system dependent, details can be used can be found on the help page ?Sys.glob.]
See ?list.files.
myFiles <- list.files(pattern="data.*csv")
Then you can loop over myFiles.
I would put all the CSV files in a directory, create a list and do a loop to read all the csv files from the directory in the list.
setwd("~/Documents/")
ldf <- list() # creates a list
listcsv <- dir(pattern = "*.csv") # creates the list of all the csv files in the directory
for (k in 1:length(listcsv)){
ldf[[k]] <- read.csv(listcsv[k])
}
str(ldf[[1]])
Read the headers in a file so that we can use them for replacing in merged file
library(dplyr)
library(readr)
list_file <- list.files(pattern = "*.csv") %>%
lapply(read.csv, stringsAsFactors=F) %>%
bind_rows
fi <- list.files(directory_path,full.names=T)
dat <- lapply(fi,read.csv)
dat will contain the datasets in a list
Let's assume that your files have the file format that you mentioned in your question and that they are located in the working directory.
You can vectorise creation of the file names if they have a simple naming structure. Then apply a loading function on all the files (here I used purrr package, but you can also use lapply)
library(purrr)
c(1:100) %>% paste0("data", ., ".csv") %>% map(read.csv)
Here's another solution using a for loop. I like it better than the others because of its flexibility and because all dfs are directly stored in the global environment.
Assume you've already set your working directory, the algorithm will iteratively read all files and store them in the global environment with the name "datai".
list <- c(1:100)
for (i in list) {
filename <- paste0("data", i)
wd <- paste0("data", i, ".csv")
assign(filename, read.csv(wd))
}
First, set the working directory.
Find and store all the files ending with .csv.
Bind all of them row-wise.
Following is the code sample:
setwd("C:/yourpath")
temp <- list.files(pattern = "*.csv")
allData <- do.call("rbind",lapply(Sys.glob(temp), read.csv))
This may be helpful if you have datasets for participants as in psychology/sports/medicine etc.
setwd("C:/yourpath")
temp <- list.files(pattern = "*.sav")
#Maybe you want to unselect /delete IDs
DEL <- grep('ID(04|08|11|13|19).sav', temp)
temp2 <- temp[-DEL]
#Make a list of that contains all data
read.all <- lapply(temp2, read_sav)
#View(read.all[1])
#Option 1: put one under the next
df <- do.call("rbind", read.all)
Option 2: make something within each dataset (single IDs) e.g. get the mean of certain parts of each participant
mw_extraktion <- function(data_raw){
data_raw <- data.frame(data_raw)
#you may now calculate e.g. the mean for a certain variable for each ID
ID <- data_raw$ID[1]
data_OneID <- c(ID, Var2, Var3) #put your new variables (e.g. Means) here
} #end of function
data_combined <- t(data.frame(sapply(read.all, mw_extraktion) ) )

Resources