Read and rbind multiple csv files - r

I have a series of csv files (one per anum) with the same column headers and different number of rows. Originally I was reading them in and merging them like so;
setwd <- ("N:/Ring data by cruise/Shetland")
LengthHeight2013 <- read.csv("N:/Ring data by cruise/Shetland/R_0113A_S2013_WD.csv",sep=",",header=TRUE)
LengthHeight2012 <- read.csv("N:/Ring data by cruise/Shetland/R_0212A_S2012_WD.csv",sep=",",header=TRUE)
LengthHeight2011 <- read.csv("N:/Ring data by cruise/Shetland/R_0211A_S2011_WOD.csv",sep=",",header=TRUE)
LengthHeight2010 <- read.csv("N:/Ring data by cruise/Shetland/R_0310A_S2010_WOD.csv",sep=",",header=TRUE)
LengthHeight2009 <- read.csv("N:/Ring data by cruise/Shetland/R_0309A_S2009_WOD.csv",sep=",",header=TRUE)
LengthHeight <- merge(LengthHeight2013,LengthHeight2012,all=TRUE)
LengthHeight <- merge(LengthHeight,LengthHeight2011,all=TRUE)
LengthHeight <- merge(LengthHeight,LengthHeight2010,all=TRUE)
LengthHeight <- merge(LengthHeight,LengthHeight2009,all=TRUE)
I would like to know if there is a shorter/tidier way to do this, also considering that each time I run the script I might want to look at a different range of years.
I also found this bit of code by Tony Cookson which looks like it would do what I want, however the data frame it produces for me has only the correct headers but no data rows.
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)
mymergeddata = multmerge("C://R//mergeme")

Find files (list.files) and read the files in a loop (lapply), then call (do.call) row bind (rbind) to put all files together by rows.
myMergedData <-
do.call(rbind,
lapply(list.files(path = "N:/Ring data by cruise"), read.csv))
Update: There is a vroom package, according to the manuals it is much faster than data.table::fread and base read.csv. The syntax looks neat, too:
library(vroom)
myMergedData <- vroom(files)

If you're looking for speed, then try this:
require(data.table) ## 1.9.2 or 1.9.3
ans = rbindlist(lapply(filenames, fread))

You could try it in a tidy way like this:
files <- list.files('the path of you files',
pattern = ".csv$", recursive = TRUE, full.names = TRUE)
myMergedData <- read_csv(files) %>% bind_rows()

Don't have enough rep to comment, but to answer Rafael Santos, you can use the code here to add params to the lapply in the answer above. Using lapply and read.csv on multiple files (in R)

Related

Binding rows of multiple data frames into one data frame in R

I have a vector of file paths called dfs, and I want create a dataframe of those files and bind them together into one huge dataframe, so I did something like this :
for (df in dfs){
clean_df <- bind_rows(as.data.table(read.delim(df, header=T, sep="|")))
return(clean_df)
}
but only the last item in the dataframe is being returned. How do I fix this?
I'm not sure about your file format, so I'll take common .csv as an example. Replace the a * i part with actually reading all the different files, instead of just generating mockup data.
files = list()
for (i in 1:10) {
a = read.csv('test.csv', header = FALSE)
a = a * i
files[[i]] = a
}
full_frame = data.frame(data.table::rbindlist(files))
The problem is that you can only pass one file at a time to the function read.delim(). So the solution would be to use a function like lapply() to read in each file specified in your df.
Here's an example, and you can find other answers to your question here.
library(tidyverse)
df <- c("file1.txt","file2.txt")
all.files <- lapply(df,function(i){read.delim(i, header=T, sep="|")})
clean_df <- bind_rows(all.files)
(clean_df)
Note that you don't need the function return(), putting the clean_df in parenthesis prompts R to print the variable.

Merging of multiple excel files in R

I am getting a basic problem in R. I have to merge 72 excel files with similar data type having same variables. I have to merge them to a single data set in R. I have used the below code for merging but this seems NOT practical for so many files. Can anyone help me please?
data1<-read.csv("D:/Customer_details1/data-01.csv")
data2<-read.csv("D:/Customer_details2/data-02.csv")
data3<-read.csv("D:/Customer_details3/data-03.csv")
data_merged<-merge(data1,data2,all.x = TRUE, all.y = TRUE)
data_merged2<-merge(data_merged,data3,all.x = TRUE, all.y = TRUE)
First, if the extensions are .csv, they're not Excel files, they're .csv files.
We can leverage the apply family of functions to do this efficiently.
First, let's create a list of the files:
setwd("D://Customer_details1/")
# create a list of all files in the working directory with the .csv extension
files <- list.files(pattern="*.csv")
Let's use purrr::map in this case, although we could also use lapply - updated to map_dfr to remove the need for reduce, by automatically rbind-ing into a data frame:
library(purrr)
mainDF <- files %>% map_dfr(read.csv)
You can pass additional arguments to read.csv if you need to: map(read.csv, ...)
Note that for rbind to work the column names have to be the same, and I'm assuming they are based on your question.
#Method I
library(readxl)
library(tidyverse)
path <- "C:/Users/Downloads"
setwd(path)
fn.import<- function(x) read_excel("country.xlsx", sheet=x)
sheet = excel_sheets("country.xlsx")
data_frame = lapply(setNames(sheet, sheet), fn.import )
data_frame = bind_rows(data_frame, .id="Sheet")
print (data_frame)
#Method II
install.packages("rio")
library(rio)
path <- "C:/Users/country.xlsx"
data <- import_list(path , rbind=TRUE)

How to load and merge multiple .csv files in r?

So I'm very new at R and right now I'm trying to load multiple .csv files (~60 or so) and then merge them together. They all have similar columns and their files are named like this: dem_file_30, dem_file_31.
I've been trying to use scripts online but keep getting some errors. I'm sure I can do it by hand but that would be really tedious.
Example:
file_list <- list.files("/home/sjclark/demographics/")
list_of_files <- lapply(file_list, read.csv)
m1 <- merge_all(list_of_files, all=TRUE)
Error: merge_all doesn't exist
This one seems to read them into R, but then I'm not how to do after that... help?
setwd("/home/sjclark/demographics/")
filenames <- list.files(full.names=TRUE)
All <- lapply(filenames,function(i){
read.csv(i, header=TRUE)
})
It appears as if you might be trying to use the nice function shared on R-bloggers (credit to Tony Cookson):
multMerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
datalist = lapply(filenames,
function(x){read.csv(file = x,
header = TRUE,
stringsAsFactors = FALSE)})
Reduce(function(x,y) {merge(x, y, all = TRUE)}, datalist)
}
Or perhaps you have pieced things together from difference sources? In any event, merge is the crucial base R function that you were missing. merge_all doesn't exist in any package.
Since you're new to R (and maybe all programming) it's worth noting that you'll need to define this function before you use it. Once you've done that you can call it like any other function:
my_data <- multMerge("/home/sjclark/demographics/")
I have just been doing a very similar task and was also wondering if there is a faster/better way to do it using dplyr and bind_rows.
My code for this task is uses ldply from plyr:
library(plyr)
filenames <- list.files(path = "mypath", pattern = "*", full.names=TRUE)
import.list <- ldply(filenames, read.csv)
Hope that helps
Rory

R, rbind with multiple files defined by a variable

First off, this is related to a homework question for the Coursera R programming course. I have found other ways to do what I want to do but my research has led me to a question I'm curious about. I have a variable number of csv files that I need to pull data from and then take the mean of the "pollutant" column in said files. The files are listed in their directory with an id number. I put together the following code which works fine for a single csv file but doesn't work for multiple csv files:
pollutantmean <- function (directory, pollutant, id = 1:332) {
id <- formatC(id, width=3, flag="0")`
dataset<-read.csv(paste(directory, "/", id,".csv",sep=""),header=TRUE)`
mean(dataset[,pollutant], na.rm = TRUE)`
}
I also know how to rbind multiple csv files together if I know the ids when I am creating the function, but I am not sure how to assign rbind to a variable range of ids or if thats even possible. I found other ways to do it such as calling an lapply and the unlisting the data, just curious if there is an easier way.
Well, this uses an lapply, but it might be what you want.
file_list <- list.files("*your directory*", full.names = T)
combined_data <- do.call(rbind, lapply(file_list, read.csv, header = TRUE))
This will turn all of your files into one large dataset, and from there it's easy to take the mean. Is that what you wanted?
An alternative way of doing this would be to step through file by file, taking sums and number of observations and then taking the mean afterwards, like so:
sums <- numeric()
n <- numeric()
i <- 1
for(file in file_list){
temp_df <- read.csv(file, header = T)
temp_mean <- mean(temp_df$pollutant)
sums[i] <- sum(temp_df$pollutant)
n[i] <- nrow(temp_df)
i <- i + 1
}
new_mean <- sum(sums)/sum(n)
Note that both of these methods require that only your desired csvs are in that folder. You can use a pattern argument in the list.files call if you have other files in there that you're not interested in.
A vector is not accepted for 'file' in read.csv(file, ...)
Below is a slight modification of yours. A vector of file paths are created and they are looped by sapply.
files <- paste("directory-name/",formatC(1:332, width=3, flag="0"),
".csv",sep="")
pollutantmean <- function(file, pollutant) {
dataset <- read.csv(file, header = TRUE)
mean(dataset[, pollutant], na.rm = TRUE)
}
sapply(files, pollutantmean)

Loop in R to read many files

I have been wondering if anybody knows a way to create a loop that loads files/databases in R.
Say i have some files like that: data1.csv, data2.csv,..., data100.csv.
In some programming languages you one can do something like this data +{ x }+ .csv the system recognizes it like datax.csv, and then you can apply the loop.
Any ideas?
Sys.glob() is another possibility - it's sole purpose is globbing or wildcard expansion.
dataFiles <- lapply(Sys.glob("data*.csv"), read.csv)
That will read all the files of the form data[x].csv into list dataFiles, where [x] is nothing or anything.
[Note this is a different pattern to that in #Joshua's Answer. There, list.files() takes a regular expression, whereas Sys.glob() just uses standard wildcards; which wildcards can be used is system dependent, details can be used can be found on the help page ?Sys.glob.]
See ?list.files.
myFiles <- list.files(pattern="data.*csv")
Then you can loop over myFiles.
I would put all the CSV files in a directory, create a list and do a loop to read all the csv files from the directory in the list.
setwd("~/Documents/")
ldf <- list() # creates a list
listcsv <- dir(pattern = "*.csv") # creates the list of all the csv files in the directory
for (k in 1:length(listcsv)){
ldf[[k]] <- read.csv(listcsv[k])
}
str(ldf[[1]])
Read the headers in a file so that we can use them for replacing in merged file
library(dplyr)
library(readr)
list_file <- list.files(pattern = "*.csv") %>%
lapply(read.csv, stringsAsFactors=F) %>%
bind_rows
fi <- list.files(directory_path,full.names=T)
dat <- lapply(fi,read.csv)
dat will contain the datasets in a list
Let's assume that your files have the file format that you mentioned in your question and that they are located in the working directory.
You can vectorise creation of the file names if they have a simple naming structure. Then apply a loading function on all the files (here I used purrr package, but you can also use lapply)
library(purrr)
c(1:100) %>% paste0("data", ., ".csv") %>% map(read.csv)
Here's another solution using a for loop. I like it better than the others because of its flexibility and because all dfs are directly stored in the global environment.
Assume you've already set your working directory, the algorithm will iteratively read all files and store them in the global environment with the name "datai".
list <- c(1:100)
for (i in list) {
filename <- paste0("data", i)
wd <- paste0("data", i, ".csv")
assign(filename, read.csv(wd))
}
First, set the working directory.
Find and store all the files ending with .csv.
Bind all of them row-wise.
Following is the code sample:
setwd("C:/yourpath")
temp <- list.files(pattern = "*.csv")
allData <- do.call("rbind",lapply(Sys.glob(temp), read.csv))
This may be helpful if you have datasets for participants as in psychology/sports/medicine etc.
setwd("C:/yourpath")
temp <- list.files(pattern = "*.sav")
#Maybe you want to unselect /delete IDs
DEL <- grep('ID(04|08|11|13|19).sav', temp)
temp2 <- temp[-DEL]
#Make a list of that contains all data
read.all <- lapply(temp2, read_sav)
#View(read.all[1])
#Option 1: put one under the next
df <- do.call("rbind", read.all)
Option 2: make something within each dataset (single IDs) e.g. get the mean of certain parts of each participant
mw_extraktion <- function(data_raw){
data_raw <- data.frame(data_raw)
#you may now calculate e.g. the mean for a certain variable for each ID
ID <- data_raw$ID[1]
data_OneID <- c(ID, Var2, Var3) #put your new variables (e.g. Means) here
} #end of function
data_combined <- t(data.frame(sapply(read.all, mw_extraktion) ) )

Resources