Using a particular function, I wish to merge pairs of data frames, for multiple pairings in an R directory. I am trying to write a ‘for loop’ that will do this job for me, and while related questions such as Merge several data.frames into one data.frame with a loop are helpful, I am struggling to adapt example loops for this particular use.
My data frames end with either “_df1.csv” or ‘_df2.csv”. Each pair, that I wish to merge into an output data frame, has an identical number at the being of the file name (i.e. 543_df1.csv and 543_df2.csv).
I have created a character string for each of the two types of file in my directory using the list.files command as below:
df1files <- list.files(path="~/Desktop/combined files” pattern="*_df1.csv", full.names=T, recursive=FALSE)
df2files <- list.files(path="="~/Desktop/combined files ", pattern="*_df2.csv", full.names=T, recursive=FALSE)
The function and commands that I want to apply in order to merge each pair of data frames are as follows:
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
I am now trying to incorporate these commands into a for loop starting with something along the following lines, to prevent me from having to manually merge the pairs:
for(i in 1:length(df2files)){ ……
I am not yet a strong R programmer, and have hit a wall, so any help would be greatly appreciated.
My intuition (which I haven't had a chance to check) is that you should be able to do something like the following:
# read in the data as two lists of dataframes:
dfs1 <- lapply(df1files, read.csv)
dfs2 <- lapply(df2files, read.csv)
# define your merge commands as a function
merge2 <- function(df1, df2){
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
}
# apply that merge command to the list of lists
mergeddfs <- mapply(merge2, dfs1, dfs2, SIMPLIFY=FALSE)
# write results to files
outfilenames <- gsub("df1","merged",df1files)
mapply(function(x,y) write.csv(x,y), mergeddfs, outfilenames)
Related
I have several DFs. Each of them is res csv file of one participant form my exp. Some of the csv have 48 variables. Others have in addition to these identical variables 6 more variable (53 variables). However, If I try to merge them like this:
flist <- list.files(path="my path", pattern = ".csv", full.names = TRUE)
Merge<-plyr::ldply(flist, read_csv) #Merge all files
the merging is done by the columns orders and not by the variable name. Therefore in one column in my big combine DF I get data form different variables.
So I tried different strategy: uploading my files as separate DFs:
data_files <- list.files("my_path") # Identify file names
data_files
for(i in 1:length(data_files)) { # Head of for-loop
assign(paste0("data", i), # Read and store data frames
read_csv(paste0("my_path/",
data_files[i])))
}
Then I tried to merge them by this script:
listDF <- names(which(unlist(eapply(.GlobalEnv,is.data.frame)))) #list of my DFs
listDF
library(plyr)
MergeDF<-do.call('rbind.fill', listDF)
But I'm still stuck.
We may use map_dfr
library(readr)
library(purrr)
map_dfr(setNames(flist, flist), read_csv, .id = "id")
I have a list of titles that I would like to iterate over and create/save data frames to. I have tried the using the paste() function (as seen below) but that does not work for me. Any advice would be greatly appreciated.
samples <- list("A","B","C")
for (i in samples){
paste(i,sumT,sep="_") <- data.frame(col1=NA,col1=NA)
}
My desired output is three empty data frames named: A_sumT, B_sumT and C_sumT
Here's an answer with purrr.
samples <- list("A", "B", "C")
samples %>%
purrr::map(~ data.frame()) %>%
purrr::set_names(~ paste(samples, "sumT", sep="_"))
Consider creating a list of dataframes and avoid many separate objects flooding global environment as this example can extend to hundreds and not just three. Plus with this approach, you will maintain one container capable of running bulk operations across all dataframes.
By using sapply below on a character vector, you create a named list:
samples <- c("A","B","C") # OR unlist(list("A","B","C"))
df_list <- sapply(samples, function(x) data.frame(col1=NA,col2=NA), simplify=FALSE)
# RUN ANY DATAFRAME OPERATION
head(df_list$A)
tail(df_list$B)
summary(df_list$C)
# BULK OPERATIONS
stacked_df <- do.call(rbind, df_list)
stacked_df <- do.call(cbind, df_list)
merged_df <- Reduce(function(x,y) merge(x,y,by="col1"), df_list)
Or if you need to rename list
# RENAME LIST
df_list <- setNames(df_list, paste0(samples, "_sumT"))
# RUN ANY DATAFRAME OPERATION
head(df_list$A_sumT)
tail(df_list$B_sumT)
summary(df_list$C_sumT)
The below is driving me a little crazy and I’m sure theres an easy solution.
I currently use R to perform some calculations from a bunch of excel files, where the files are monthly observations of financial data. The files all have the exact same column headers. Each file gets imported, gets some calcs done on it and the output is saved to a list. The next file is imported and the process is repeated. I use the following code for this:
filelist <- list.files(pattern = "\\.xls")
universe_list <- list()
count <- 1
for (file in filelist) {
df <- read.xlsx(file, 1, startRow=2, header=TRUE)
*perform calcs*
universe_list[[count]] <- df
count <- count + 1
}
I now have a problem where some of the new operations I want to perform would involve data from two or more excel files. So for example, I would need to import the Jan-16 and the Jan-15 excel files, perform whatever needs to be done, and then move on to the next set of files (Feb-16 and Feb-15). The files will always be of fixed length apart (like one year etc)
I cant seem to figure out the code on how to do this… from a process perspective, Im thinking 1) need to design a loop to import both sets of files at the same time, 2) create two dataframes from the imported data, 3) rename the columns of one of the dataframes (so the columns can be distinguished), 4) merge both dataframes together, and 4) perform the calcs. I cant work out the code for steps 1-4 for this!
Many thanks for helping out
Consider mapply() to handle both data frame pairs together. Your current loop is actually reminiscient of other languages running for loop operations. However, R has many vectorized approaches to iterate over lists. Below assumes both 15 and 16 year list of files are same length with corresponding months in both and year abbrev comes right before file extension (i.e, -15.xls, -16.xls):
files15list <- list.files(path, pattern = "[15]\\.xls")
files16list <- list.files(path, pattern = "[16]\\.xls")
dfprocess <- function(x, y){
df1 <- read.xlsx(x, 1, startRow=2, header=TRUE)
names(df1) <- paste0(names(df1), "1") # SUFFIX COLS WITH 1
df2 <- read.xlsx(y, 1, startRow=2, header=TRUE)
names(df2) <- paste0(names(df2), "2") # SUFFIX COLS WITH 2
df <- cbind(df1, df2) # CBIND DFs
# ... perform calcs ...
return(df)
}
wide_list <- mapply(dfprocess, files15list, files16list)
long_list <- lapply(1:ncol(wide_list),
function(i) wide_list[,i]) # ALTERNATE OUTPUT
First sort your filelist such that the two files on which you want to do your calculations are consecutive to each other. After that try this:
count <- 1
for (count in seq(1, (len(filelist)),2) {
df <- read.xlsx(filelist[count], 1, startRow=2, header=TRUE)
df1 <- read.xlsx(filelist[count+1], 1, startRow=2, header=TRUE)
*change column names and apply merge or append depending on requirement
*perform calcs*
*save*
}
I need to download 300+ .csv files available online and combine them into a dataframe in R. They all have the same column names but vary in length (number of rows).
l<-c(1441,1447,1577)
s1<-"https://coraltraits.org/species/"
s2<-".csv"
for (i in l){
n<-paste(s1,i,s2, sep="") #creates download url for i
x <- read.csv( curl(n) ) #reads download url for i
#need to sucessively combine each of the 3 dataframes into one
}
Like #RohitDas said, continuously appending a data frame is very inefficient and will be slow. Just download each of the csv files as an entry in a list, and then bind all the rows after collecting all the data in the list.
l <- c(1441,1447,1577)
s1 <- "https://coraltraits.org/species/"
s2 <- ".csv"
# Initialize a list
x <- list()
# Loop through l and download the table as an element in the list
for(i in l) {
n <- paste(s1, i, s2, sep = "") # Creates download url for i
# Download the table as the i'th entry in the list, x
x[[i]] <- read.csv( curl(n) ) # reads download url for i
}
# Combine the list of data frames into one data frame
x <- do.call("rbind", x)
Just a warning: all the data frames in x must have the same columns to do this. If one of the entries in x has a different number of columns, or differently named columns, the rbind will fail.
More efficient row binding functions (with some extras, such as column filling) exist in several different packages. Take a look at some of these solutions for binding rows:
plyr::rbind.fill()
dplyr::bind_rows()
data.table::rbindlist()
If they have the same columns then its just a matter of appending the rows. A simple (but not memory efficient) approach is using rbind in a loop
l<-c(1441,1447,1577)
s1<-"https://coraltraits.org/species/"
s2<-".csv"
data <- NULL
for (i in l){
n<-paste(s1,i,s2, sep="") #creates download url for i
x <- read.csv( curl(n) ) #reads download url for i
#need to sucessively combine each of the 3 dataframes into one
data <- rbind(data,x)
}
A more efficient way would be to build a list and then combine them into a single data frame at the end, but I will leave that as an exercise for you.
I am using the ExtremeBounds package which provides as a result a multi level list with (amongst others) dataframes at the lowest level. I run this package over several specifications and I would like to collect some columns of selected dataframes in these results. These should be collected by specification (spec1 and spec2 in the example below) and arranged in a list of dataframes. This list of dataframes can then be used for all kind of things, for example to export the results of different specifications into different Excel Sheets.
Here is some code which creates the problematic object (just run this code blindly, my problem only concerns how to deal with the kind of list it creates: eba_results):
library("ExtremeBounds")
Data <- data.frame(var1=rbinom(30,1,0.2),var2=rbinom(30,2,0.2),
var3=rnorm(30),var4=rnorm(30),var5=rnorm(30))
spec1 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4"))
spec2 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4","var5"))
indicators <- c("spec1","spec2")
ebaFun <- function(x){
eba <- eba(data=Data, y=x$y,
free=x$freevars,
doubtful=x$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, weights = "lri", family = binomial(logit))}
eba_results <- lapply(mget(indicators),ebaFun) #eba_results is the object in question
Manually I know how to access each element, for example:
eba_results$spec1$bounds$type #look at str(eba_results) to see the different levels
So "bounds" is a dataframe with identical column names for both spec1 and spec2. I would like to collect the following 5 columns from "bounds":
type, cdf.mu.normal, cdf.above.mu.normal, cdf.mu.generic, cdf.above.mu.generic
into one dataframe per spec. Manually this is simple but ugly:
collectedManually <-list(
manual_spec1 = data.frame(
type=eba_results$spec1$bounds$type,
cdf.mu.normal=eba_results$spec1$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec1$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec1$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec1$bounds$cdf.above.mu.generic),
manual_spec2= data.frame(
type=eba_results$spec2$bounds$type,
cdf.mu.normal=eba_results$spec2$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec2$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec2$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec2$bounds$cdf.above.mu.generic))
But I have more than 2 specifications and I think this should be possible with lapply functions in a prettier way. Any help would be appreciated!
p.s.: A generic example to which hrbrmstr's answer applies but which turned out to be too simplistic:
exampleList = list(a=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))),
b=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))))
and I want to have an object which collects, for example, all the A and B vectors into two data frames (each with its respective A and B) which are then a list of data frames. Manually this would look like:
dfa <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
dfb <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
collectedResults <- list(a=dfa, b=dfb)
There's probably a less brute-force way to do this.
If you want lists of individual columns this is one way:
get_col <- function(my_list, col_name) {
unlist(lapply(my_list, function(x) {
lapply(x, function(y) { y[, col_name] })
}), recursive=FALSE)
}
get_col(exampleList, "A")
get_col(exampleList, "B")
If you want a consolidated data.frame of indicator columns this is one way:
collect_indicators <- function(my_list, indicators) {
lapply(my_list, function(x) {
do.call(rbind, c(lapply(x, function(y) { y[, indicators] }), make.row.names=FALSE))
})[[1]]
}
collect_indicators(exampleList, c("A", "B"))
If you just want to bring the individual data.frames up a level to make it easier to iterate over to write to a file:
unlist(exampleList, recursive=FALSE)
Much assumption about the true output format is being made (the question was a bit vague).
There is a brute force way which works but is dependent on several named objects:
collectEBA <- function(x){
df <- paste0("eba_results$",x,"$bounds")
df <- eval(parse(text=df))[,c("type",
"cdf.mu.normal","cdf.above.mu.normal",
"cdf.mu.generic","cdf.above.mu.generic")]
df[is.na(df)] <- "NA"
df
}
eba_export <- lapply(indicators,collectEBA)
names(eba_export) <- indicators