A Function to Merge 100 Dataframes to one Dataframe - r

I am new to programming and R is my first programming language to learn.
I want to merge 100 dataframes; each dataframe contains one column and 20 observations, as shown below:
df1 <- as.data.frame(c(6,3,4,4,5,...))
df2 <- as.data.frame(c(2,2,3,5,10,...))
df3 <- as.data.frame(c(5,9,2,3,7,...))
...
df100 <- as.data.frame(c(4,10,5,9,8,...))
I tried using df.list <- list(df1:df100) to construct an overall dataframe for all of the dataframes but I am not sure if df.list merges all the columns from all the dataframes together in a table.
Can anyone tell me if I am right? And what do I need to do?

We can use mget to get all the objects into a list by specifying the pattern in 'ls' to check for object names that starts (^) with 'df' followed by one or mor digits (\\d+) till the end ($) of the string
df.list <- mget(ls(pattern = '^df\\d+$'))
From the list, if we can want to cbind all the datasets, use cbind in do.call
out <- do.call(cbind, df.list)
NOTE: It is better not to create multiple objects in the global environment. We could have read all the data into a list directly or constructed within a list i.e. if the files are read from .csv, get all the files with .csv from the directory of interest with list.files, then loop over the files in lapply, read them individually with read.csv and cbind
files <- list.files(path = 'path/to/your/location',
pattern = '\\.csv$', full.names = TRUE)
out <- do.call(cbind, lapply(files, read.csv))

We can also use reduce function from purrr package, after creating a character vector of names of data frames:
library(dplyr)
library(purrr)
names <- paste0("df", 1:100)
names %>%
reduce(.init = get(names[1]), ~ bind_rows(..1, get(..2)))
Or in base R:
Reduce(function(x, y) rbind(x, get(y)), names, init = get(names[1]))

Related

merge data nasted dataframes in R

I have several DFs. Each of them is res csv file of one participant form my exp. Some of the csv have 48 variables. Others have in addition to these identical variables 6 more variable (53 variables). However, If I try to merge them like this:
flist <- list.files(path="my path", pattern = ".csv", full.names = TRUE)
Merge<-plyr::ldply(flist, read_csv) #Merge all files
the merging is done by the columns orders and not by the variable name. Therefore in one column in my big combine DF I get data form different variables.
So I tried different strategy: uploading my files as separate DFs:
data_files <- list.files("my_path") # Identify file names
data_files
for(i in 1:length(data_files)) { # Head of for-loop
assign(paste0("data", i), # Read and store data frames
read_csv(paste0("my_path/",
data_files[i])))
}
Then I tried to merge them by this script:
listDF <- names(which(unlist(eapply(.GlobalEnv,is.data.frame)))) #list of my DFs
listDF
library(plyr)
MergeDF<-do.call('rbind.fill', listDF)
But I'm still stuck.
We may use map_dfr
library(readr)
library(purrr)
map_dfr(setNames(flist, flist), read_csv, .id = "id")

do.call skip error and continue processing

After a for loop I create 4 dataframes (data1, data2,data3,data4), i want to rbind all of them.
I tried:
do.call(rbind, mget(paste0("data", 1:4)))
but sometimes, the for loop gives me only 3 of them, for example: data1, data2, data4.
it seems that do.call doesn't know how to handle this issue.
How could I do to still have an rbind of data1, data2, data4?
You can get all your objects from the global environment (via ls()) and use grep to get the ones that follow the pattern you need, i.e.
do.call(rbind, mget(grep('data[0-9]+', ls(), value = TRUE)))
Maybe check if dataframe exists in the environment and mget only those.
data_names <- paste0("data", 1:4)
do.call(rbind, mget(data_names[sapply(data_names, exists)]))
You can use pattern matching mechanism in ls to identify your objects, as mget takes character vector of object names and pattern argument in ls can use regular expression, which is more flexible than generating object names via paste.
data_cars_one <- mtcars
data_cars_two <- mtcars
library(tidyverse)
res_all <- bind_rows(mget(x = ls(pattern = "^data")))
Concerning the binding, I've used bind_rows just as an alternative to do.call and Reduce solutions.

Saving data frames to values in a list

I have a list of titles that I would like to iterate over and create/save data frames to. I have tried the using the paste() function (as seen below) but that does not work for me. Any advice would be greatly appreciated.
samples <- list("A","B","C")
for (i in samples){
paste(i,sumT,sep="_") <- data.frame(col1=NA,col1=NA)
}
My desired output is three empty data frames named: A_sumT, B_sumT and C_sumT
Here's an answer with purrr.
samples <- list("A", "B", "C")
samples %>%
purrr::map(~ data.frame()) %>%
purrr::set_names(~ paste(samples, "sumT", sep="_"))
Consider creating a list of dataframes and avoid many separate objects flooding global environment as this example can extend to hundreds and not just three. Plus with this approach, you will maintain one container capable of running bulk operations across all dataframes.
By using sapply below on a character vector, you create a named list:
samples <- c("A","B","C") # OR unlist(list("A","B","C"))
df_list <- sapply(samples, function(x) data.frame(col1=NA,col2=NA), simplify=FALSE)
# RUN ANY DATAFRAME OPERATION
head(df_list$A)
tail(df_list$B)
summary(df_list$C)
# BULK OPERATIONS
stacked_df <- do.call(rbind, df_list)
stacked_df <- do.call(cbind, df_list)
merged_df <- Reduce(function(x,y) merge(x,y,by="col1"), df_list)
Or if you need to rename list
# RENAME LIST
df_list <- setNames(df_list, paste0(samples, "_sumT"))
# RUN ANY DATAFRAME OPERATION
head(df_list$A_sumT)
tail(df_list$B_sumT)
summary(df_list$C_sumT)

import multiple files and extract specific column in r

I have 20 data file (.txt). My end goal is to chose a specific column (let say V3) from each 20 files, and make a new file.
I tried
temp <- list.files(pattern='*.snp.blp')
How i can extract V3 from each 20 files and combine (cbind) them in r?
We can use fread from data.table which also have the option of select to select only the specific columns we intend to read instead of reading the whole data
library(data.table)
library(purrr)
library(dplyr)
map(temp, fread, select = 'V3') %>%
bind_cols
If the number of rows are not the same, then use cbind.fill
out <- map(temp, fread, select = 'V3')
do.call(rowr::cbind.fill, c(out, fill = NA))
data
set.seed(24)
invisible(map(paste0('snp.blp', 1:3, '.csv'), ~
matrix(sample(1:10, 10 * 3, replace = TRUE), ncol = 3,
dimnames = list(NULL, paste0("V", 1:3))) %>%
as_tibble %>%
readr::write_csv(., path = .x)))
temp <- list.files(pattern='snp.blp')
Arguably it's better to rbind() the rows of the same variable across multiple files than cbind() them, especially since cbind() fails when the files have different numbers of rows.
In the situation where we need to combine only a single column from multiple files, we can also use unlist() instead of rbind().
A complete, working example combining rows using base R can be accomplished with lapply(), an anonymous function, and unlist(). We'll use data from Alex Barradas' Pokémon Stats database from kaggle.com, where I've restructured the data into 6 CSV files, one for each of the first six generations of Pokémon.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/pokemonData.zip",
"pokemonData.zip",
method="wininet",mode="wb")
unzip("pokemonData.zip")
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
attackStats <- lapply(thePokemonFiles,function(x) {
# read data and subset to Attack stat using the extract operator [
read.csv(x)["Attack"]
})
# unlist to combine into a vector
attackStats <- unlist(attackStats)
# use the data in another R function
hist(attackStats)
...and the output:

Using for loops to match pairs of data frames in R

Using a particular function, I wish to merge pairs of data frames, for multiple pairings in an R directory. I am trying to write a ‘for loop’ that will do this job for me, and while related questions such as Merge several data.frames into one data.frame with a loop are helpful, I am struggling to adapt example loops for this particular use.
My data frames end with either “_df1.csv” or ‘_df2.csv”. Each pair, that I wish to merge into an output data frame, has an identical number at the being of the file name (i.e. 543_df1.csv and 543_df2.csv).
I have created a character string for each of the two types of file in my directory using the list.files command as below:
df1files <- list.files(path="~/Desktop/combined files” pattern="*_df1.csv", full.names=T, recursive=FALSE)
df2files <- list.files(path="="~/Desktop/combined files ", pattern="*_df2.csv", full.names=T, recursive=FALSE)
The function and commands that I want to apply in order to merge each pair of data frames are as follows:
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
I am now trying to incorporate these commands into a for loop starting with something along the following lines, to prevent me from having to manually merge the pairs:
for(i in 1:length(df2files)){ ……
I am not yet a strong R programmer, and have hit a wall, so any help would be greatly appreciated.
My intuition (which I haven't had a chance to check) is that you should be able to do something like the following:
# read in the data as two lists of dataframes:
dfs1 <- lapply(df1files, read.csv)
dfs2 <- lapply(df2files, read.csv)
# define your merge commands as a function
merge2 <- function(df1, df2){
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
}
# apply that merge command to the list of lists
mergeddfs <- mapply(merge2, dfs1, dfs2, SIMPLIFY=FALSE)
# write results to files
outfilenames <- gsub("df1","merged",df1files)
mapply(function(x,y) write.csv(x,y), mergeddfs, outfilenames)

Resources