How to load and merge multiple .csv files in r? - r

So I'm very new at R and right now I'm trying to load multiple .csv files (~60 or so) and then merge them together. They all have similar columns and their files are named like this: dem_file_30, dem_file_31.
I've been trying to use scripts online but keep getting some errors. I'm sure I can do it by hand but that would be really tedious.
Example:
file_list <- list.files("/home/sjclark/demographics/")
list_of_files <- lapply(file_list, read.csv)
m1 <- merge_all(list_of_files, all=TRUE)
Error: merge_all doesn't exist
This one seems to read them into R, but then I'm not how to do after that... help?
setwd("/home/sjclark/demographics/")
filenames <- list.files(full.names=TRUE)
All <- lapply(filenames,function(i){
read.csv(i, header=TRUE)
})

It appears as if you might be trying to use the nice function shared on R-bloggers (credit to Tony Cookson):
multMerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
datalist = lapply(filenames,
function(x){read.csv(file = x,
header = TRUE,
stringsAsFactors = FALSE)})
Reduce(function(x,y) {merge(x, y, all = TRUE)}, datalist)
}
Or perhaps you have pieced things together from difference sources? In any event, merge is the crucial base R function that you were missing. merge_all doesn't exist in any package.
Since you're new to R (and maybe all programming) it's worth noting that you'll need to define this function before you use it. Once you've done that you can call it like any other function:
my_data <- multMerge("/home/sjclark/demographics/")

I have just been doing a very similar task and was also wondering if there is a faster/better way to do it using dplyr and bind_rows.
My code for this task is uses ldply from plyr:
library(plyr)
filenames <- list.files(path = "mypath", pattern = "*", full.names=TRUE)
import.list <- ldply(filenames, read.csv)
Hope that helps
Rory

Related

Loop over all subdirectories and read in a file in each subdirectory

I have an output directory from dbcans with each sample output in a subdirectory. I need to loop over each subdrectory are read into R a file called overview.csv.
for (subdir in list.dirs(recursive = FALSE)){
data = read.csv(file.path(~\\subdir, "overview.csv"))
}
I am unsure how to deal with the changing filepath in read.csv for each subdir. Any help would be appriciated.
Up front, the ~\\subdir (not as a string) is obviously problematic. Since subdir is already a string, using file.path is correct but with just the variable. If you are concerned about relative versus absolute, you can always normalize the paths with normalizePath(list.dirs()), though this does not really change things if you use `
A few things to consider.
Constantly reassigning to the same variable doesn't help, so either you need to assign to an element of a list or something else (e.g., lapply, below). (I also think data as a variable name is problematic. While it works just fine "now", if you ever run part of your script without assigning to data, you will be referencing the function, resulting in possibly confusing errors such as Error in data$a : object of type 'closure' is not subsettable; since a closure is really just a function with its enclosing namespace/environment, this is just saying "you tried to do something to a function".)
I think both pattern= and full.names= might be useful to switch from using list.dirs to list.files, such as
datalist <- list()
# I hope recursion doesn't go too deep here
filelist <- list.files(pattern = "overview.csv", full.names = TRUE, recursive = TRUE)
for (ind in seq_along(filelist)) {
datalist[[ind]] <- read.csv(filelist[ind])
}
# perhaps combine into one frame
data1 <- do.call(rbind, datalist)
Reading in lots of files and doing them same thing to all of them suggests lapply. This is a little more compact version of number 2:
filelist <- list.files(pattern = "overview.csv", recursive = TRUE, full.names = TRUE)
datalist <- lapply(filelist, read.csv)
data1 <- do.call(rbind, datalist)
Note: if you really only need precisely one level of subdirs, you can work around that with:
filelist <- list.files(list.dirs(somepath, recursive = FALSE),
pattern = "overview.csv", full.names = TRUE)
or you can limit to no more than some depth, perhaps with list.dirs.depth.n from https://stackoverflow.com/a/48300309.
I think it should be this.
for (subdir in list.dirs(recursive = FALSE)){
data = read.csv(paste0(subdir, "overview.csv"))
}

Create dataframe from list in Rproj

I have an issue that really bugs me: I've tried to convert to Rproj lately, because I would like to make my data and scripts available at some point. But with one of them, I get an error that, I think, should not occur. Here is the tiny code that gives me so much trouble, the R.proj being available at: https://github.com/fredlm/mockup.
library(readxl)
list <- list.files(path = "data", pattern = "file.*.xls") #List excel files
#Aggregate all excel files
df <- lapply(list, read_excel)
for (i in 1:length(df)){
df[[i]] <- cbind(df[[i]], list[i])
}
df <- do.call("rbind", df)
It gives me the following error right after "df <- lapply(list, read_excel)":
Error in read_fun(path = path, sheet = sheet, limits = limits, shim =
shim, : path[1]="file_1.xls": No such file or directory
Do you know why? When I do it old school, i.e. using 'setwd' before creating 'list', everything works just fine. So it looks like lapply does not know where to look for the file when used in a Rproj, which seems very odd...
What did I miss?
Thanks :)
Thanks to a non-stackoverflower, a solution was found. It's silly, but 'list' was missing a directory, so lapply couldn't aggregate the data. The following works just fine:
list <- paste("data/", list.files(path = "data", pattern = pattern = "file.*.xls"), sep = "") #List excel files

Merging of multiple excel files in R

I am getting a basic problem in R. I have to merge 72 excel files with similar data type having same variables. I have to merge them to a single data set in R. I have used the below code for merging but this seems NOT practical for so many files. Can anyone help me please?
data1<-read.csv("D:/Customer_details1/data-01.csv")
data2<-read.csv("D:/Customer_details2/data-02.csv")
data3<-read.csv("D:/Customer_details3/data-03.csv")
data_merged<-merge(data1,data2,all.x = TRUE, all.y = TRUE)
data_merged2<-merge(data_merged,data3,all.x = TRUE, all.y = TRUE)
First, if the extensions are .csv, they're not Excel files, they're .csv files.
We can leverage the apply family of functions to do this efficiently.
First, let's create a list of the files:
setwd("D://Customer_details1/")
# create a list of all files in the working directory with the .csv extension
files <- list.files(pattern="*.csv")
Let's use purrr::map in this case, although we could also use lapply - updated to map_dfr to remove the need for reduce, by automatically rbind-ing into a data frame:
library(purrr)
mainDF <- files %>% map_dfr(read.csv)
You can pass additional arguments to read.csv if you need to: map(read.csv, ...)
Note that for rbind to work the column names have to be the same, and I'm assuming they are based on your question.
#Method I
library(readxl)
library(tidyverse)
path <- "C:/Users/Downloads"
setwd(path)
fn.import<- function(x) read_excel("country.xlsx", sheet=x)
sheet = excel_sheets("country.xlsx")
data_frame = lapply(setNames(sheet, sheet), fn.import )
data_frame = bind_rows(data_frame, .id="Sheet")
print (data_frame)
#Method II
install.packages("rio")
library(rio)
path <- "C:/Users/country.xlsx"
data <- import_list(path , rbind=TRUE)

R, rbind with multiple files defined by a variable

First off, this is related to a homework question for the Coursera R programming course. I have found other ways to do what I want to do but my research has led me to a question I'm curious about. I have a variable number of csv files that I need to pull data from and then take the mean of the "pollutant" column in said files. The files are listed in their directory with an id number. I put together the following code which works fine for a single csv file but doesn't work for multiple csv files:
pollutantmean <- function (directory, pollutant, id = 1:332) {
id <- formatC(id, width=3, flag="0")`
dataset<-read.csv(paste(directory, "/", id,".csv",sep=""),header=TRUE)`
mean(dataset[,pollutant], na.rm = TRUE)`
}
I also know how to rbind multiple csv files together if I know the ids when I am creating the function, but I am not sure how to assign rbind to a variable range of ids or if thats even possible. I found other ways to do it such as calling an lapply and the unlisting the data, just curious if there is an easier way.
Well, this uses an lapply, but it might be what you want.
file_list <- list.files("*your directory*", full.names = T)
combined_data <- do.call(rbind, lapply(file_list, read.csv, header = TRUE))
This will turn all of your files into one large dataset, and from there it's easy to take the mean. Is that what you wanted?
An alternative way of doing this would be to step through file by file, taking sums and number of observations and then taking the mean afterwards, like so:
sums <- numeric()
n <- numeric()
i <- 1
for(file in file_list){
temp_df <- read.csv(file, header = T)
temp_mean <- mean(temp_df$pollutant)
sums[i] <- sum(temp_df$pollutant)
n[i] <- nrow(temp_df)
i <- i + 1
}
new_mean <- sum(sums)/sum(n)
Note that both of these methods require that only your desired csvs are in that folder. You can use a pattern argument in the list.files call if you have other files in there that you're not interested in.
A vector is not accepted for 'file' in read.csv(file, ...)
Below is a slight modification of yours. A vector of file paths are created and they are looped by sapply.
files <- paste("directory-name/",formatC(1:332, width=3, flag="0"),
".csv",sep="")
pollutantmean <- function(file, pollutant) {
dataset <- read.csv(file, header = TRUE)
mean(dataset[, pollutant], na.rm = TRUE)
}
sapply(files, pollutantmean)

Read and rbind multiple csv files

I have a series of csv files (one per anum) with the same column headers and different number of rows. Originally I was reading them in and merging them like so;
setwd <- ("N:/Ring data by cruise/Shetland")
LengthHeight2013 <- read.csv("N:/Ring data by cruise/Shetland/R_0113A_S2013_WD.csv",sep=",",header=TRUE)
LengthHeight2012 <- read.csv("N:/Ring data by cruise/Shetland/R_0212A_S2012_WD.csv",sep=",",header=TRUE)
LengthHeight2011 <- read.csv("N:/Ring data by cruise/Shetland/R_0211A_S2011_WOD.csv",sep=",",header=TRUE)
LengthHeight2010 <- read.csv("N:/Ring data by cruise/Shetland/R_0310A_S2010_WOD.csv",sep=",",header=TRUE)
LengthHeight2009 <- read.csv("N:/Ring data by cruise/Shetland/R_0309A_S2009_WOD.csv",sep=",",header=TRUE)
LengthHeight <- merge(LengthHeight2013,LengthHeight2012,all=TRUE)
LengthHeight <- merge(LengthHeight,LengthHeight2011,all=TRUE)
LengthHeight <- merge(LengthHeight,LengthHeight2010,all=TRUE)
LengthHeight <- merge(LengthHeight,LengthHeight2009,all=TRUE)
I would like to know if there is a shorter/tidier way to do this, also considering that each time I run the script I might want to look at a different range of years.
I also found this bit of code by Tony Cookson which looks like it would do what I want, however the data frame it produces for me has only the correct headers but no data rows.
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)
mymergeddata = multmerge("C://R//mergeme")
Find files (list.files) and read the files in a loop (lapply), then call (do.call) row bind (rbind) to put all files together by rows.
myMergedData <-
do.call(rbind,
lapply(list.files(path = "N:/Ring data by cruise"), read.csv))
Update: There is a vroom package, according to the manuals it is much faster than data.table::fread and base read.csv. The syntax looks neat, too:
library(vroom)
myMergedData <- vroom(files)
If you're looking for speed, then try this:
require(data.table) ## 1.9.2 or 1.9.3
ans = rbindlist(lapply(filenames, fread))
You could try it in a tidy way like this:
files <- list.files('the path of you files',
pattern = ".csv$", recursive = TRUE, full.names = TRUE)
myMergedData <- read_csv(files) %>% bind_rows()
Don't have enough rep to comment, but to answer Rafael Santos, you can use the code here to add params to the lapply in the answer above. Using lapply and read.csv on multiple files (in R)

Resources