Merging of multiple excel files in R - r

I am getting a basic problem in R. I have to merge 72 excel files with similar data type having same variables. I have to merge them to a single data set in R. I have used the below code for merging but this seems NOT practical for so many files. Can anyone help me please?
data1<-read.csv("D:/Customer_details1/data-01.csv")
data2<-read.csv("D:/Customer_details2/data-02.csv")
data3<-read.csv("D:/Customer_details3/data-03.csv")
data_merged<-merge(data1,data2,all.x = TRUE, all.y = TRUE)
data_merged2<-merge(data_merged,data3,all.x = TRUE, all.y = TRUE)

First, if the extensions are .csv, they're not Excel files, they're .csv files.
We can leverage the apply family of functions to do this efficiently.
First, let's create a list of the files:
setwd("D://Customer_details1/")
# create a list of all files in the working directory with the .csv extension
files <- list.files(pattern="*.csv")
Let's use purrr::map in this case, although we could also use lapply - updated to map_dfr to remove the need for reduce, by automatically rbind-ing into a data frame:
library(purrr)
mainDF <- files %>% map_dfr(read.csv)
You can pass additional arguments to read.csv if you need to: map(read.csv, ...)
Note that for rbind to work the column names have to be the same, and I'm assuming they are based on your question.

#Method I
library(readxl)
library(tidyverse)
path <- "C:/Users/Downloads"
setwd(path)
fn.import<- function(x) read_excel("country.xlsx", sheet=x)
sheet = excel_sheets("country.xlsx")
data_frame = lapply(setNames(sheet, sheet), fn.import )
data_frame = bind_rows(data_frame, .id="Sheet")
print (data_frame)
#Method II
install.packages("rio")
library(rio)
path <- "C:/Users/country.xlsx"
data <- import_list(path , rbind=TRUE)

Related

Best way to import multiple .csv files into separate data frames? lapply() [duplicate]

Suppose we have files file1.csv, file2.csv, ... , and file100.csv in directory C:\R\Data and we want to read them all into separate data frames (e.g. file1, file2, ... , and file100).
The reason for this is that, despite having similar names they have different file structures, so it is not that useful to have them in a list.
I could use lapply but that returns a single list containing 100 data frames. Instead I want these data frames in the Global Environment.
How do I read multiple files directly into the global environment? Or, alternatively, How do I unpack the contents of a list of data frames into it?
Thank you all for replying.
For completeness here is my final answer for loading any number of (tab) delimited files, in this case with 6 columns of data each where column 1 is characters, 2 is factor, and remainder numeric:
##Read files named xyz1111.csv, xyz2222.csv, etc.
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
##Create list of data frame names without the ".csv" part
names <-substr(filenames,1,7)
###Load all files
for(i in names){
filepath <- file.path("../Data/original_data/",paste(i,".csv",sep=""))
assign(i, read.delim(filepath,
colClasses=c("character","factor",rep("numeric",4)),
sep = "\t"))
}
Quick draft, untested:
Use list.files() aka dir() to dynamically generate your list of files.
This returns a vector, just run along the vector in a for loop.
Read the i-th file, then use assign() to place the content into a new variable file_i
That should do the trick for you.
Use assign with a character variable containing the desired name of your data frame.
for(i in 1:100)
{
oname = paste("file", i, sep="")
assign(oname, read.csv(paste(oname, ".txt", sep="")))
}
This answer is intended as a more useful complement to Hadley's answer.
While the OP specifically wanted each file read into their R workspace as a separate object, many other people naively landing on this question may think that that's what they want to do, when in fact they'd be better off reading the files into a single list of data frames.
So for the record, here's how you might do that.
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files("path/to/files")
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files,read.csv,...)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",
list.files("path/to/files",full.names = FALSE),
fixed = TRUE)
Now any of the files can be referred to by my_files[["filename"]], which really isn't much worse that just having separate filename variables in your workspace, and often it is much more convenient.
Here is a way to unpack a list of data.frames using just lapply
filenames <- list.files(path="../Data/original_data",
pattern="xyz+.*csv")
filelist <- lappy(filenames, read.csv)
#if necessary, assign names to data.frames
names(filelist) <- c("one","two","three")
#note the invisible function keeps lapply from spitting out the data.frames to the console
invisible(lapply(names(filelist), function(x) assign(x,filelist[[x]],envir=.GlobalEnv)))
Reading all the CSV files from a folder and creating vactors same as the file names:
setwd("your path to folder where CSVs are")
filenames <- gsub("\\.csv$","", list.files(pattern="\\.csv$"))
for(i in filenames){
assign(i, read.csv(paste(i, ".csv", sep="")))
}
A simple way to access the elements of a list from the global environment is to attach the list. Note that this actually creates a new environment on the search path and copies the elements of your list into it, so you may want to remove the original list after attaching to prevent having two potentially different copies floating around.
I want to update the answer given by Joran:
#If the path is different than your working directory
# you'll need to set full.names = TRUE to get the full
# paths.
my_files <- list.files(path="set your directory here", full.names=TRUE)
#full.names=TRUE is important to be added here
#Further arguments to read.csv can be passed in ...
all_csv <- lapply(my_files, read.csv)
#Set the name of each list element to its
# respective file name. Note full.names = FALSE to
# get only the file names, not the full path.
names(all_csv) <- gsub(".csv","",list.files("copy and paste your directory here",full.names = FALSE),fixed = TRUE)
#Now you can create a dataset based on each filename
df <- as.data.frame(all_csv$nameofyourfilename)
a simplified version, assuming your csv files are in the working directory:
listcsv <- list.files(pattern= "*.csv") #creates list from csv files
names <- substr(listcsv,1,nchar(listcsv)-4) #creates list of file names, no .csv
for (k in 1:length(listcsv)){
assign(names[[k]] , read.csv(listcsv[k]))
}
#cycles through the names and assigns each relevant dataframe using read.csv
#copy all the files you want to read in R in your working directory
a <- dir()
#using lapply to remove the".csv" from the filename
for(i in a){
list1 <- lapply(a, function(x) gsub(".csv","",x))
}
#Final step
for(i in list1){
filepath <- file.path("../Data/original_data/..",paste(i,".csv",sep=""))
assign(i, read.csv(filepath))
}
Use list.files and map_dfr to read many csv files
df <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)
Reproducible example
First write sample csv files to a temporary directory.
It's more complicated than I thought it would be.
library(dplyr)
library(purrr)
library(purrrlyr)
library(readr)
data_folder <- file.path(tempdir(), "iris")
dir.create(data_folder)
iris %>%
# Keep the Species column in the output
# Create a new column that will be used as the grouping variable
mutate(species_group = Species) %>%
group_by(species_group) %>%
nest() %>%
by_row(~write.csv(.$data,
file = file.path(data_folder, paste0(.$species_group, ".csv")),
row.names = FALSE))
Read these csv files into one data frame.
Note the Species column has to be present in the csv files, otherwise we would loose that information.
iris_csv <- list.files(data_folder, full.names = TRUE) %>%
map_dfr(read_csv)

How to read in a list of files to a single data.table

I am using R. I have a series of txt files that all all ; seperated. I want to combined those files into 1 data frame. Each individual file has the variable names in the top row.
I found a useful post for doing this online. How to import multiple .csv files at once?. I am using rbindlist to do this, though I am open to suggestions if there is an easier way to import this data.
Here is my code.
# create list of files
temp <- NULL
temp <- as.list (list.files ("F:/Desktop/data"))
temp
# now rbind the list using rbindlist
data <- NULL
data <- rbindlist(temp, fill=FALSE, idcol=NULL)
is.list (temp)
typeof (temp)
When I run this code, I get an error
Error in rbindlist(temp, fill = FALSE, idcol = NULL) :
Item 1 of input is not a data.frame, data.table or list
However, when I check temp using is.list and typeof, it shows up as a list. I am not sure why rbindlist isn't reading the list as a list. Is rbindlist the best way to import this data.
The list.files is just giving the file names. We need to read those files. As we are using rbindlist from data.table, read with fread and then use the rbindlist
library(data.table)
temp <- list.files ("F:/Desktop/data", full.names = TRUE, pattern = "\\.txt$")
out <- rbindlist(lapply(temp, fread), fill = TRUE)

Read data from data files into R data frames

I would like to read data from several files into separated data frames. The files are in a different folder than the script.
I have used list with filenames.
users_list <- list.files(path = "Data_Eye/Jazz/data/first",
pattern = "*.cal", full.names = F)
I tried to use functions map and read_delim but without success. It is important for me to read each file to a different dataframe. It will be the best to have the list of data frames.
You can do something like this, though I don't have any .cal files to test it on. So, it's possible you might need a different function for reading in those files.
library(devtools)
devtools::install_github("https://github.com/KirtOnthank/OTools")
library(OTools)
# Give full path to your files (if different than working directory).
temp = list.files(path="../Data/original_data", pattern="*.cal", full.names = TRUE)
# Then, apply the read.cal function to the list of files.
myfiles = lapply(temp, OTools::read.cal)
# Then, set the name of each list element (each dataframe) to its respective file name.
names(myfiles) <- gsub(".cal","",
list.files("../Data/original_data",full.names = FALSE),
fixed = TRUE)
# Now, put all of those individual dataframes from your list into the global environment as separate dataframes.
list2env(myfiles,envir=.GlobalEnv)
In base R, just use lapply to generate a list of data frames:
list_of_dfs <- lapply(users_list, read_delim)

how can I import and convert multiple excel files in a folder into dataframe in R

I have 1,000 files in a folder and want to work on them as a data frame in R. I can convert individual files but need a way to convert files all at once. Any help is needed, please.
what I have so far:
my code:
my_files <- list.files()
my_files <- as.data.frame(my_files)
my_files
The class(my_files) shows a data frame but not really working as a data frame
Here is the solution I use in this situation. Make changes where needed. You need to install the packages rio and data.table first.
my_files <- list.files(path = "~", # change this to the path where your files are
pattern = ".xlsx$", # use the file ending of your files
full.names = TRUE,
recursive = TRUE)
my_df_list <- lapply(my_files, rio::import) # imports files and produces list of data.frames
my_df <- data.table::rbindlist(my_df_list) # attempts to bind data.frames into one
We can use list.files to get path of the files and using lapply we can iterate over file names, read files and remove the first column.
my_files <- list.files(full.names = TRUE)
all_files <- lapply(my_files, function(x) read.csv(x)[-1])
all_files is a list of dataframes without the first column in each file which can be accessed via all_files[[1]], all_files[[2]] etc.

How to read and rbind all .xlsx files in a folder efficiently using read_excel

I am new to R and need to create one dataframe from 80 .xlsx files that mostly share the same columns and are all in the same folder. I want to bind all these files efficiently in a manner that would work if I added or removed files from the folder later. I want to do this without converting the files to .csv, unless someone can show me how to that efficiently for large numbers of files within R itself.
I've previously been reading files individually using the read_excel function from the readxl package. After, I would use rbind to bind them. This was fine for 10 files, but not 80! I've experimented with many solutions offered online however none of these seem to work, largely because they are using functions other than read_excel or formats other than .xlsx. I haven't kept track of many of my failed attempts, so cannot offer code other than one alternate method I tried to adapt to read_excel from the read_csv function.
#Method 1
library(readxl)
library(purr)
library(dplyr)
library(tidyverse)
file.list <- list.files(pattern='*.xlsx')
alldata <- file.list %>%
map(read_excel) %>%
reduce(rbind)
#Output
New names:
* `` -> ...2
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Any code on how to do this would be greatly appreciated. Sorry if anything is wrong about this post, it is my first one.
UPDATE:
Using the changes suggested by the answers, I'm now using the code:
file.list <- list.files(pattern='*.xlsx')
alldata <- file.list %>%
map_dfr(read_excel) %>%
reduce(bind_rows)
This output now is as follows:
New names:
* `` -> ...2
Error: Column `10.Alert.alone` can't be converted from numeric to character
This happens regardless of which type of bind() function I use in the reduce() slot. If anyone can help with this, please let me know!
You're on the right track here. But you need to use map_dfr instead of plain-vanilla map. map_dfr outputs a data frame (or actually tibble) for each iteration, and combines them via bind_rows.
This should work:
library(readxl)
library(tidyverse)
file.list <- list.files(pattern='*.xlsx')
alldata <- file.list %>%
map_dfr(~read_excel(.x))
Note that this assumes your files all have consistent column names and data types. If they don't, you may have to do some cleaning. (One trick I've used in complex cases is to add a %>% mutate_all(as.character) to the read_excel command inside the map function. That will turn everything into characters, and then you can convert the data types from there.)
this should get you there/close...
library(data.table)
library(readxl)
#create files list
file.list <- list.files( pattern = ".*\\.xlsx$", full.names = TRUE )
#read files to list of data.frames
l <- lapply( l, readxl::read_excel )
#bind l together to one larger data.table, by columnname, fill missing with NA
dt <- data.table::rbindlist( l, use.names = TRUE, fill = TRUE )
Try using map_dfr.
alldata <- file.list %>%
map_dfr(read_excel)

Resources