merge data nasted dataframes in R - r

I have several DFs. Each of them is res csv file of one participant form my exp. Some of the csv have 48 variables. Others have in addition to these identical variables 6 more variable (53 variables). However, If I try to merge them like this:
flist <- list.files(path="my path", pattern = ".csv", full.names = TRUE)
Merge<-plyr::ldply(flist, read_csv) #Merge all files
the merging is done by the columns orders and not by the variable name. Therefore in one column in my big combine DF I get data form different variables.
So I tried different strategy: uploading my files as separate DFs:
data_files <- list.files("my_path") # Identify file names
data_files
for(i in 1:length(data_files)) { # Head of for-loop
assign(paste0("data", i), # Read and store data frames
read_csv(paste0("my_path/",
data_files[i])))
}
Then I tried to merge them by this script:
listDF <- names(which(unlist(eapply(.GlobalEnv,is.data.frame)))) #list of my DFs
listDF
library(plyr)
MergeDF<-do.call('rbind.fill', listDF)
But I'm still stuck.

We may use map_dfr
library(readr)
library(purrr)
map_dfr(setNames(flist, flist), read_csv, .id = "id")

Related

Merging thousands of csv files into a single dataframe in R

I have 2500 csv files, all with the same columns and a variable number of observations.
Each file is approximately 3mb (~10000 obs per file).
Ideally, I would like to read all of these in to a single dataframe.
Each file represents a generation and contains info in regard to traits, phenotypes and allele frequencies.
While reading in this data, I am also trying to add an extra column to each read indicating the generation.
I have written the following code:
read_data <- function(ex_files,ex){
df <- NULL
ex <- as.character(ex)
for(n in 1:length(ex_files)){
temp <- read.csv(paste("Experiment ",ex,"/all e",ex," gen",as.character(n),".csv",sep=""))
temp$generation <- n
df <- rbind(df,temp)
}
return(df)
}
ex_files refers to list.length, while ex refers to the experiment number as it was performed in replicate (ie. I have multiple experiments each with 2500 csv files).
I am currently running it (I hope it's written correctly!), however it is taking quite a while (as expected). I'm wondering if there is a quicker way of doing this at all?
It is inefficient to grow objects in a loop. List all the files that you want to read using list.files and with purrr::map_df combine them into one dataframe with an additional column called generation which will give a unique number to each file.
filenames <- list.files(pattern = '\\.csv', full.names = TRUE)
df <- purrr::map_df(filenames, read.csv, .id = 'generation')
head(df)
Try plyr package
filenames = list.files(pattern = '\\.csv', full.names = TRUE)
df = plyr::ldpy(filenames , data.frame)

A Function to Merge 100 Dataframes to one Dataframe

I am new to programming and R is my first programming language to learn.
I want to merge 100 dataframes; each dataframe contains one column and 20 observations, as shown below:
df1 <- as.data.frame(c(6,3,4,4,5,...))
df2 <- as.data.frame(c(2,2,3,5,10,...))
df3 <- as.data.frame(c(5,9,2,3,7,...))
...
df100 <- as.data.frame(c(4,10,5,9,8,...))
I tried using df.list <- list(df1:df100) to construct an overall dataframe for all of the dataframes but I am not sure if df.list merges all the columns from all the dataframes together in a table.
Can anyone tell me if I am right? And what do I need to do?
We can use mget to get all the objects into a list by specifying the pattern in 'ls' to check for object names that starts (^) with 'df' followed by one or mor digits (\\d+) till the end ($) of the string
df.list <- mget(ls(pattern = '^df\\d+$'))
From the list, if we can want to cbind all the datasets, use cbind in do.call
out <- do.call(cbind, df.list)
NOTE: It is better not to create multiple objects in the global environment. We could have read all the data into a list directly or constructed within a list i.e. if the files are read from .csv, get all the files with .csv from the directory of interest with list.files, then loop over the files in lapply, read them individually with read.csv and cbind
files <- list.files(path = 'path/to/your/location',
pattern = '\\.csv$', full.names = TRUE)
out <- do.call(cbind, lapply(files, read.csv))
We can also use reduce function from purrr package, after creating a character vector of names of data frames:
library(dplyr)
library(purrr)
names <- paste0("df", 1:100)
names %>%
reduce(.init = get(names[1]), ~ bind_rows(..1, get(..2)))
Or in base R:
Reduce(function(x, y) rbind(x, get(y)), names, init = get(names[1]))

R - Combine multiple data frames according to the pattern in their name

I would like to combine data frames in the global environment according to the pattern in their name, and simultaneously add the name of the file they are originally from.
My problem is that I have originally a zip file, with over 20 text files in the main folder and sub-folders, which observe mainly two different scenarios: "test" and "train". Hence, I decided to first read ALL of the txt files into R, create two different lists of df names which either have "test" or "train" pattern and using those lists merge the dataframes into two main dataframes. Now, I need to combine those dataframes according to the names in the list, but the rbind just creates another list of their names - how to make rbind treat inputs as objects from the name list, not strings?
Moreover, rbind would combine the dfs without an opportunity to add the variable of their column names - maybe there is a solution which lets to simultaneously combine dfs and add the df name as a column variable?
What I did so far:
#loading the necessary libraries
library(dplyr)
library(readr)
library(easycsv)
#setting url and directory of the data file
url <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"
destination <- "accelerometer_data.zip"
#downloading the file and storing it into computer memory
download.file(url, destfile = destination)
#read all txt files into R
test_folder <- easycsv::fread_zip(file = destination,
extension = "TXT")
#create a list of "test" data frames
list_test <- as.list(
do.call(cbind, ls(
grep(pattern = "^UCI+(.*)test",
x = ls(),
value = TRUE)
)
)
)
)
#bind dfs as named in list_test
test_df <- lapply(list_test, FUN = function(x) {
rbind(
eval(
parse(text = x)
)
)
}
)
You can use mget to get all the data with specific pattern in a list, then use dplyr::bind_rows to combine them into one dataframe and use .id parameter to include the file name as a separate column.
library(dplyr)
test_data <- bind_rows(mget(grep(pattern = "^UCI+(.*)test", x = ls(),
value = TRUE)), .id = 'filename')
train_data <- bind_rows(mget(grep(pattern = "^UCI+(.*)train", x = ls(),
value = TRUE)), .id = 'filename')
However, the 'test' and 'train' files have dataframes with different number of columns hence you have certain columns with only NAs for some files. Maybe you need to update the pattern and make the pattern more strict?

R: Loop for importing multiple xls as df, rename column of one df and then merge all df's

The below is driving me a little crazy and I’m sure theres an easy solution.
I currently use R to perform some calculations from a bunch of excel files, where the files are monthly observations of financial data. The files all have the exact same column headers. Each file gets imported, gets some calcs done on it and the output is saved to a list. The next file is imported and the process is repeated. I use the following code for this:
filelist <- list.files(pattern = "\\.xls")
universe_list <- list()
count <- 1
for (file in filelist) {
df <- read.xlsx(file, 1, startRow=2, header=TRUE)
*perform calcs*
universe_list[[count]] <- df
count <- count + 1
}
I now have a problem where some of the new operations I want to perform would involve data from two or more excel files. So for example, I would need to import the Jan-16 and the Jan-15 excel files, perform whatever needs to be done, and then move on to the next set of files (Feb-16 and Feb-15). The files will always be of fixed length apart (like one year etc)
I cant seem to figure out the code on how to do this… from a process perspective, Im thinking 1) need to design a loop to import both sets of files at the same time, 2) create two dataframes from the imported data, 3) rename the columns of one of the dataframes (so the columns can be distinguished), 4) merge both dataframes together, and 4) perform the calcs. I cant work out the code for steps 1-4 for this!
Many thanks for helping out
Consider mapply() to handle both data frame pairs together. Your current loop is actually reminiscient of other languages running for loop operations. However, R has many vectorized approaches to iterate over lists. Below assumes both 15 and 16 year list of files are same length with corresponding months in both and year abbrev comes right before file extension (i.e, -15.xls, -16.xls):
files15list <- list.files(path, pattern = "[15]\\.xls")
files16list <- list.files(path, pattern = "[16]\\.xls")
dfprocess <- function(x, y){
df1 <- read.xlsx(x, 1, startRow=2, header=TRUE)
names(df1) <- paste0(names(df1), "1") # SUFFIX COLS WITH 1
df2 <- read.xlsx(y, 1, startRow=2, header=TRUE)
names(df2) <- paste0(names(df2), "2") # SUFFIX COLS WITH 2
df <- cbind(df1, df2) # CBIND DFs
# ... perform calcs ...
return(df)
}
wide_list <- mapply(dfprocess, files15list, files16list)
long_list <- lapply(1:ncol(wide_list),
function(i) wide_list[,i]) # ALTERNATE OUTPUT
First sort your filelist such that the two files on which you want to do your calculations are consecutive to each other. After that try this:
count <- 1
for (count in seq(1, (len(filelist)),2) {
df <- read.xlsx(filelist[count], 1, startRow=2, header=TRUE)
df1 <- read.xlsx(filelist[count+1], 1, startRow=2, header=TRUE)
*change column names and apply merge or append depending on requirement
*perform calcs*
*save*
}

Using for loops to match pairs of data frames in R

Using a particular function, I wish to merge pairs of data frames, for multiple pairings in an R directory. I am trying to write a ‘for loop’ that will do this job for me, and while related questions such as Merge several data.frames into one data.frame with a loop are helpful, I am struggling to adapt example loops for this particular use.
My data frames end with either “_df1.csv” or ‘_df2.csv”. Each pair, that I wish to merge into an output data frame, has an identical number at the being of the file name (i.e. 543_df1.csv and 543_df2.csv).
I have created a character string for each of the two types of file in my directory using the list.files command as below:
df1files <- list.files(path="~/Desktop/combined files” pattern="*_df1.csv", full.names=T, recursive=FALSE)
df2files <- list.files(path="="~/Desktop/combined files ", pattern="*_df2.csv", full.names=T, recursive=FALSE)
The function and commands that I want to apply in order to merge each pair of data frames are as follows:
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
I am now trying to incorporate these commands into a for loop starting with something along the following lines, to prevent me from having to manually merge the pairs:
for(i in 1:length(df2files)){ ……
I am not yet a strong R programmer, and have hit a wall, so any help would be greatly appreciated.
My intuition (which I haven't had a chance to check) is that you should be able to do something like the following:
# read in the data as two lists of dataframes:
dfs1 <- lapply(df1files, read.csv)
dfs2 <- lapply(df2files, read.csv)
# define your merge commands as a function
merge2 <- function(df1, df2){
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
}
# apply that merge command to the list of lists
mergeddfs <- mapply(merge2, dfs1, dfs2, SIMPLIFY=FALSE)
# write results to files
outfilenames <- gsub("df1","merged",df1files)
mapply(function(x,y) write.csv(x,y), mergeddfs, outfilenames)

Resources