Merging thousands of csv files into a single dataframe in R - r

I have 2500 csv files, all with the same columns and a variable number of observations.
Each file is approximately 3mb (~10000 obs per file).
Ideally, I would like to read all of these in to a single dataframe.
Each file represents a generation and contains info in regard to traits, phenotypes and allele frequencies.
While reading in this data, I am also trying to add an extra column to each read indicating the generation.
I have written the following code:
read_data <- function(ex_files,ex){
df <- NULL
ex <- as.character(ex)
for(n in 1:length(ex_files)){
temp <- read.csv(paste("Experiment ",ex,"/all e",ex," gen",as.character(n),".csv",sep=""))
temp$generation <- n
df <- rbind(df,temp)
}
return(df)
}
ex_files refers to list.length, while ex refers to the experiment number as it was performed in replicate (ie. I have multiple experiments each with 2500 csv files).
I am currently running it (I hope it's written correctly!), however it is taking quite a while (as expected). I'm wondering if there is a quicker way of doing this at all?

It is inefficient to grow objects in a loop. List all the files that you want to read using list.files and with purrr::map_df combine them into one dataframe with an additional column called generation which will give a unique number to each file.
filenames <- list.files(pattern = '\\.csv', full.names = TRUE)
df <- purrr::map_df(filenames, read.csv, .id = 'generation')
head(df)

Try plyr package
filenames = list.files(pattern = '\\.csv', full.names = TRUE)
df = plyr::ldpy(filenames , data.frame)

Related

merge data nasted dataframes in R

I have several DFs. Each of them is res csv file of one participant form my exp. Some of the csv have 48 variables. Others have in addition to these identical variables 6 more variable (53 variables). However, If I try to merge them like this:
flist <- list.files(path="my path", pattern = ".csv", full.names = TRUE)
Merge<-plyr::ldply(flist, read_csv) #Merge all files
the merging is done by the columns orders and not by the variable name. Therefore in one column in my big combine DF I get data form different variables.
So I tried different strategy: uploading my files as separate DFs:
data_files <- list.files("my_path") # Identify file names
data_files
for(i in 1:length(data_files)) { # Head of for-loop
assign(paste0("data", i), # Read and store data frames
read_csv(paste0("my_path/",
data_files[i])))
}
Then I tried to merge them by this script:
listDF <- names(which(unlist(eapply(.GlobalEnv,is.data.frame)))) #list of my DFs
listDF
library(plyr)
MergeDF<-do.call('rbind.fill', listDF)
But I'm still stuck.
We may use map_dfr
library(readr)
library(purrr)
map_dfr(setNames(flist, flist), read_csv, .id = "id")

Conducting summary statistics on multiple dataframes in R

Apologies if this has been answered elsewhere. I am looking to calculate and output summary statistics across multiple dataframes in R.
For context, my data is stored in .txt files for each subject - just one column: 63 obs of 1 variable. In total I have 48 files corresponding to 48 subjects.
I read these files into Rstudio and created multiple per-subject dataframes using the following scripts:
filenames <- gsub("\\.txt$","", list.files(pattern="\\.txt$"))
for(i in filenames){
assign(i, read.delim(paste(i,".txt", sep="")))
}
The nomenclature of the dataframes are e.g. 001_fd, 002_fd ...
So what I hope to do is create a for loop that calculates summary stats for each dataframe and then output the results for each into a single csv file.
Any assistance here will be greatly appreciated
It is not preferred to have object names that start with numbers. You have also not mentioned what do you mean by summary statistics, what exactly you want to calculate, I'll calculate mean and median here, you can include more if needed.
First, get all the dataframes in a list using mget
list_df <- mget(ls(pattern = '\\d+_fd'))
Using lapply, you can calculate whatever you want. Let's say you have a single column in each dataframe with x as a column name, you can do
output_df <- do.call(rbind, lapply(list_df, function(df)
data.frame(mean = mean(df$x), med = median(df$x))))
Or with purrr::map_df which makes this shorter.
output_df <- purrr::map_df(list_df,
~data.frame(mean = mean(.x$x), med = median(.x$x)))
Write the results to csv.
write.csv(output_df, 'results.csv', row.names = FALSE)
You don't have to use assign to create variable for each txt file.
Just use list.files all txt files and loop each files to a new empty dataframe.
This is the simplest method but may not be the most efficient way.
filenames <- list.files(pattern="*.txt")
output = data.frame()
for(f in filenames){
content = read.delim(f,header = FALSE)
sum = summary(content[,1])
output = rbind(output,sum)
}
colnames(output) = c("Min.","1st Qu.","Median","Mean","3rd Qu.","Max.")
write.csv(output,"output.csv",row.names = FALSE)

How to take only common columns across multiple csv's while appending data

I am currently using the below function to read in and combine several(7) csv's in R.
csv_append <- function(file_path = filePath){
files <- grep(list.files(path = file_path,full.names = TRUE), pattern="final_data_dummied_", value=T)
###Load all files into a list of dataframes
df_list = lapply(files,fread,nThread = 4)
DT = rbindlist(df_list,fill = TRUE)
# Convert data.table to dataframe
df_seg = setDF(DT)
rm(list = c('DT','df_list'))
# Replace missing values with 0
df_seg[is.na(df_seg)] <- 0
return (df_seg)
}
However the original files are large(.5 million rows and ~3500 columns). The number of columns vary from 3400 to 3700 and when I combine these files R gives memory error : cannot allocate vector of size 85Gb
I am thinking if I take intersection of columns from all the csvs and read in only those columns from each csv it might solve the problem.
But I am not sure how can I do that while reading in the files.
Can someone please help me with this?

Merging only parts of csv file and add a coloumn with the csv file name

I wanted to merge cvs files stored in the work directory and its subfolder.
This piece of code runs smoothly:
csv_files <- dir(pattern='.*[.]csv', recursive = T)
list.files()
my_data_frame <- do.call(rbind,lapply(csv_files,read.csv))`
So far so good.
I now want to add a coloumn containing the names of these csv files.
Furthermore, I want to extract only pieces of these cvs files, let's say from the 5th row to the 10th one.
Thanks for your precious help!
You could simply replace read.csv in the lapply call with your own function that does the subset and adds the new column. E.g.,
csv_files <- dir(pattern='.*[.]csv', recursive = T)
list.files()
#function to make df from each csv
my_read_csv <- function(x) {
dfx <- read.csv(x)[5:10,] #or any other subset
dfx$fname <- basename(x) #add new column
return(dfx)
}
my_data_frame <- do.call(rbind,lapply(csv_files,my_read_csv))

R: Loop for importing multiple xls as df, rename column of one df and then merge all df's

The below is driving me a little crazy and I’m sure theres an easy solution.
I currently use R to perform some calculations from a bunch of excel files, where the files are monthly observations of financial data. The files all have the exact same column headers. Each file gets imported, gets some calcs done on it and the output is saved to a list. The next file is imported and the process is repeated. I use the following code for this:
filelist <- list.files(pattern = "\\.xls")
universe_list <- list()
count <- 1
for (file in filelist) {
df <- read.xlsx(file, 1, startRow=2, header=TRUE)
*perform calcs*
universe_list[[count]] <- df
count <- count + 1
}
I now have a problem where some of the new operations I want to perform would involve data from two or more excel files. So for example, I would need to import the Jan-16 and the Jan-15 excel files, perform whatever needs to be done, and then move on to the next set of files (Feb-16 and Feb-15). The files will always be of fixed length apart (like one year etc)
I cant seem to figure out the code on how to do this… from a process perspective, Im thinking 1) need to design a loop to import both sets of files at the same time, 2) create two dataframes from the imported data, 3) rename the columns of one of the dataframes (so the columns can be distinguished), 4) merge both dataframes together, and 4) perform the calcs. I cant work out the code for steps 1-4 for this!
Many thanks for helping out
Consider mapply() to handle both data frame pairs together. Your current loop is actually reminiscient of other languages running for loop operations. However, R has many vectorized approaches to iterate over lists. Below assumes both 15 and 16 year list of files are same length with corresponding months in both and year abbrev comes right before file extension (i.e, -15.xls, -16.xls):
files15list <- list.files(path, pattern = "[15]\\.xls")
files16list <- list.files(path, pattern = "[16]\\.xls")
dfprocess <- function(x, y){
df1 <- read.xlsx(x, 1, startRow=2, header=TRUE)
names(df1) <- paste0(names(df1), "1") # SUFFIX COLS WITH 1
df2 <- read.xlsx(y, 1, startRow=2, header=TRUE)
names(df2) <- paste0(names(df2), "2") # SUFFIX COLS WITH 2
df <- cbind(df1, df2) # CBIND DFs
# ... perform calcs ...
return(df)
}
wide_list <- mapply(dfprocess, files15list, files16list)
long_list <- lapply(1:ncol(wide_list),
function(i) wide_list[,i]) # ALTERNATE OUTPUT
First sort your filelist such that the two files on which you want to do your calculations are consecutive to each other. After that try this:
count <- 1
for (count in seq(1, (len(filelist)),2) {
df <- read.xlsx(filelist[count], 1, startRow=2, header=TRUE)
df1 <- read.xlsx(filelist[count+1], 1, startRow=2, header=TRUE)
*change column names and apply merge or append depending on requirement
*perform calcs*
*save*
}

Resources