How to add/merge dataframes in a list - r

I used the following for loop to read 7 csv files and add them to a list.
list <- list()
l <- 1
for(i in 1:7){
data <- read.csv(paste("file",i,".csv",sep=""),header=FALSE)
list[[l]] <- data
l <- l + 1
}
So now I have a list named "list" containing 7 dataframes, right?
Each of the 8 dataframes contain the same three columns (NAME, SURNAME, AGE).
I now want to add:
df <- dataframe(NAME,SURNAME,AGE) ## to each dataframe in the list.
Did that help at all? My question is, how can I achieve that for all 7 objects in the list automatically!

If the 'lst' has seven data.frames and want to 'rbind' the 8th dataset to each of the datasets in the list, we can use Map
Map(rbind, lst, list(d1))
Or using lapply
lapply(lst, rbind, d1)
Update
If the 'lst' is of length 8, and wants to rbind the first 7 elements with the dataset in the 8th element, then you can just do
Map(rbind, lst[-8], lst[8])
data
set.seed(24)
lst <- lapply(1:7, function(i) as.data.frame(matrix(sample(0:10, 3*10,
replace=TRUE), ncol=3)))
set.seed(49)
d1 <- as.data.frame(matrix(sample(1:20, 3*10, replace=TRUE), ncol=3))

Or, if the ultimate goal is to just ensure all 8 CSV files make it into one data.frame:
# generate some sample files
files <- sprintf("iris%d.csv", i)
for (i in 1:8) { write.csv(iris, files, row.names=FALSE) }
# make one happy data frame
do.call(rbind, lapply(files, read.csv))

Related

combine multiple dataframes based on sequence of names

Say I have 30 dataframes all named with a date from 01/01/2000 to 30/01/2000 in the format of ddmmyy (code below) :
Season <- seq(as.Date("2000-01-01"),as.Date("2000-01-30"),1)
Season <- format(Season,"%d%m%y")
for (s in Season) {
df <- data.frame(X=1:10, Y=1:10)
aa <- paste(s,"tests",s ,sep = "_")
assign(aa,df)
}
Each name, you cans see, has the word tests added to it.I want to combine (rbind?) the data.frames based on the date. In this case, combine data.frames that contain the dates from 01-01-00 to 10-01-00.
I have the below code to combine all dataframes but what if I only want to select the ones shown above?
All_dfs <- do.call(rbind, eapply(.GlobalEnv,function(x) if(is.data.frame(x)) x))
Is it better to create a list first?
We can use mget to get the values of 'Season' in a list and then rbind the list of data.frames. As there is a suffix "tests" followed by "Season" concatenated to the "Season", we can use paste to get the string, then use mget.
res <- do.call(rbind, mget( paste0(Season[1:10], "_tests_", Season[1:10])))
dim(res)
#[1] 100 2

R: Merging lists of data frames

I'm a total noob at R and I've tried (and retried) to search for an answer to the following problem, but I've not been able to get any of the proposed solutions to do what I'm interested in.
I have two lists of named elements, with each element pointing to data frames with identical layouts:
(EDIT)
df1 <- data.frame(A=c(1,2,3),B=c("A","B","C"))
df2 <- data.frame(A=c(98,99),B=c("Y","Z"))
lst1 <- c(X=df1,Y=df2)
df3 <- data.frame(A=c(4,5),B=c("D","E"))
lst2 <- c(X=df3)
(EDIT 2)
So it seems like storing multiple data frames in a list is a bad idea, as it will convert the data frames to lists. So I'll go out looking for an alternative way to store a set of named data frames.
In general the names of the elements in the two elements might overlap partially, completely, or not at all.
I'm looking for a way to merge the two lists into a single list:
<some-function-sequence>(lst1, lst2)
->
c(X=rbind(df1,df3),Y=df2)
-resulting in something like this:
[EDIT: Syntax changed to correctly reflect desired result (list-of-data frames)]
$X
A B
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
$X.B
A B
1 98 Y
2 99 Z
I.e:
IF the lists contain identical element names, each pointing to a data frame, THEN I want to 'rbind' the rows from these two data frames and assign the resulting data frame to the same element name in the resulting list.
Otherwise the element names and data frames from both lists should just be copied into the resulting list.
I've tried the solutions from a number of discussions such as:
Can I combine a list of similar dataframes into a single dataframe?
Combine/merge lists by elements names
Simultaneously merge multiple data.frames in a list
Combine/merge lists by elements names (list in list)
Convert a list of data frames into one data frame
-but I've not been able to find the right solution. A general problem seems to be that the data frame ends up being converted into a list by the application of 'mapply/sapply/merge/...' - and usually also sliced and/or merged in ways which I am not interested in. :)
Any help with this will be much appreciated!
[SOLUTION]
The solution seems to be to change the use of c(...) when collecting data frames to list(...) after which the solution proposed by Pierre seems to give the desired result.
Here is a proposed solution using split and c to combine like terms. Please read the caveat at the bottom:
s <- split(c(lst1, lst2), names(c(lst1,lst2)))
lapply(s, function(lst) do.call(function(...) unname(c(...)), lst))
# $X.A
# [1] 1 2 3 4 5
#
# $X.B
# [1] "A" "B" "C" "D" "E"
#
# $Y.A
# [1] 98 99
#
# $Y.B
# [1] "Y" "Z"
This solution is based on NOT having factors as strings. It will not throw an error but the factors will be converted to numbers. Below I show how I transformed the data to remove factors. Let me know if you require factors:
df1 <- data.frame(A=c(1,2,3),B=c("A","B","C"), stringsAsFactors=FALSE)
df2 <- data.frame(A=c(98,99),B=c("Y","Z"), stringsAsFactors=FALSE)
lst1 <- c(X=df1,Y=df2)
df3 <- data.frame(A=c(4,5),B=c("D","E"), stringsAsFactors=FALSE)
lst2 <- c(X=df3)
If the data is stored in lists we can use:
lapply(split(c(lst1, lst2), names(c(lst1,lst2))), function(lst) do.call(rbind, lst))
The following solution is probably not the most efficient way. However, if I got your problem right this should work ;)
# Example data
# Some vectors
a <- 1:5
b <- 3:7
c <- rep(5, 5)
d <- 5:1
# Some dataframes, data1 and data3 have identical column names
data1 <- data.frame(a, b)
data2 <- data.frame(c, b)
data3 <- data.frame(a, b)
data4 <- data.frame(c, d)
# 2 lists
list1 <- list(data1, data2)
list2 <- list(data3, data4)
# Loop, wich checks for the dataframe names and rbinds dataframes with the same column names
final_list <- list1
used_lists <- numeric()
for(i in 1:length(list1)) {
for(j in 1:length(list2)) {
if(sum(colnames(list1[[i]]) == colnames(list2[[j]])) == ncol(list1[[i]])) {
final_list[[i]] <- rbind(list1[[i]], list2[[j]])
used_lists <- c(used_lists, j)
}
}
}
# Adding the other dataframes, which did not have the same column names
for(i in 1:length(list2)) {
if((i %in% used_lists) == FALSE) {
final_list[[length(final_list) + 1]] <- list2[[i]]
}
}
# Final list, which includes all other lists
final_list

Apply function between lists of data frames

I have 2 lists (my.listA and my.listB) in R including 3 data frames each:
da1 <- data.frame(x=c(1,2,3),y=c(4,5,6))
da2 <- data.frame(x=c(3,2,1),y=c(6,5,4))
da3 <- data.frame(x=c(5,4,1),y=c(8,5,7))
my.listA <- list(da1, da2, da3)
db1 <- data.frame(z=c(2))
db2 <- data.frame(z=c(3))
db3 <- data.frame(z=c(4))
my.listB <- list(db1, db2, db3)
I am trying to obtain a new list (my.listAB) so that it includes 3 data frames showing the element by element product of the data frames in my.listA and my.listB paired according to the number at the end of the data frames' names, that is, the product of elements in da1 by elements in db1, the product of da2 by db2 and the product of da3 by db3.
This would be my desired result:
dab1 <- data.frame(x=c(2,4,6),y=c(8,10,12))
dab2 <- data.frame(x=c(9,6,3),y=c(18,15,12))
dab3 <- data.frame(x=c(20,16,4),y=c(32,20,28))
my.listAB <- list(dab1 , dab2 , dab3)
I tried the following, but it did not work:
for (i in 1:3) {
my.listAB <- my.listA[[i]]*my.listB[[i]]
};
Ideally someone could guide me towards a solution using the lapply function?
Many thanks!
You can use
l <- lapply(1:3, function(x) my.listA[[x]] * my.listB[[x]]$z)
or
l <- list()
for (x in 1:3)
l[[x]] <- my.listA[[x]] * my.listB[[x]]$z
In addition to the lapply and for loop option suggested by #lukeA in the comments, you could also try Map
r1 <- Map(`*`, my.listA,unlist(my.listB))
identical(r1, my.listAB)
#[1] TRUE

rbinding a list of lists of dataframes based on nested order

I have a dataframe, df and a function process that returns a list of two dataframes, a and b. I use dlply to split up the df on an id column, and then return a list of lists of dataframes. Here's sample data/code that approximates the actual data and methods:
df <- data.frame(id1=rep(c(1,2,3,4), each=2))
process <- function(df) {
a <- data.frame(d1=rnorm(1), d2=rnorm(1))
b <- data.frame(id1=df$id1, a=rnorm(nrow(df)), b=runif(nrow(df)))
list(a=a, b=b)
}
require(plyr)
output <- dlply(df, .(id1), process)
output is a list of lists of dataframes, the nested list will always have two dataframes, named a and b. In this case the outer list has a length 4.
What I am looking to generate is a dataframe with all the a dataframes, along with an id column indicating their respective value (I believe this is left in the list as the split_labels attribute, see str(output)). Then similarly for the b dataframes.
So far I have in part used this question to come up with this code:
list <- unlist(output, recursive = FALSE)
list.a <- lapply(1:4, function(x) {
list[[(2*x)-1]]
})
all.a <- rbind.fill(list.a)
Which gives me the final a dataframe (and likewise for b with a different subscript into list), however it doesn't have the id column I need and I'm pretty sure there's got to be a more straightforward or elegant solution. Ideally something clean using plyr.
Not very clean but you can try something like this (assuming the same data generation process).
list.aID <- lapply(1:4, function(x) {
cbind(list[[(2*x) - 1]], list[[2*x]][1, 1, drop = FALSE])
})
all.aID <- rbind.fill(list.aID)
all.aID
all.aID
d1 d2 id1
1 0.68103 -0.74023 1
2 -0.50684 1.23713 2
3 0.33795 -0.37277 3
4 0.37827 0.56892 4

Vectorizing finding the row-wise mean of data frames within lists within lists

I have a list of sublists. Each sublist contains an identical data frame (identical except for the data inside it) and a 'yes/no' label. I'd like to find the row-wise mean of the data frames, if the yes/no label is TRUE.
#Create the data frames
id <- c("a", "b", "c")
df1 <- data.frame(id=id, data=c(1, 2, 3))
df2 <- df1
df3 <- data.frame(id=id, data=c(1000, 2000, 3000))
#Create the sublists that will store the data frame and the yes/no variable
sub1 <- list(data=df1, useMe=TRUE)
sub2 <- list(data=df2, useMe=TRUE)
sub3 <- list(data=df3, useMe=FALSE)
#Store the sublists in a main list
main <- list(sub1, sub2, sub3)
I want a vectorized function that will return the row-wise average of the data frames, but only if $useMe==TRUE, like so:
> desiredFun(main)
id data
1 a 1
2 b 2
3 c 3
Here's a fairly general way to approach this problem:
# Extract the "data" portion of each "main" list element
# (using lapply to return a list)
AllData <- lapply(main, "[[", "data")
# Extract the "useMe" portion of each "main" list element
# using sapply to return a vector)
UseMe <- sapply(main, "[[", "useMe")
# Select the "data" list elements where the "useMe" vector elements are TRUE
# and rbind all the data.frames together
Data <- do.call(rbind, AllData[UseMe])
library(plyr)
# Aggregate the resulting data.frame
Avg <- ddply(Data, "id", summarize, data=mean(data))

Resources