Extract and append data to new datasets in a for loop - r

I have (what I think) is a really simple question, but I can't figure out how to do it. I'm fairly new to lists, loops, etc.
I have a small dataset:
df <- c("one","two","three","four")
df <- as.data.frame(df)
df
I need to loop through this dataset and create a list of datasets, such that this is the outcome:
[[1]]
one
[[2]]
one
two
[[3]]
one
two
three
This is more or less as far as I've gotten:
blah <- list()
for(i in 1:3){
blah[[i]]<- i
}
The length will be variable when I use this in the future, so I need to automate it in a loop. Otherwise, I would just do
one <- df[1,]
two <- df[2,]
list(one, rbind(one, two))
Any ideas?

You can try using lapply :
result <- lapply(seq(nrow(df)), function(x) df[seq_len(x), , drop = FALSE])
result
#[[1]]
# df
#1 one
# [[2]]
# df
#1 one
#2 two
#[[3]]
# df
#1 one
#2 two
#3 three
#[[4]]
# df
#1 one
#2 two
#3 three
#4 four
seq(nrow(df)) creates a sequence from 1 to number of rows in your data (which is 4 in this case). function(x) part is called as anonymous function where each value from 1 to 4 is passed to one by one. seq_len(x) creates a sequence from 1 to x i.e 1 to 1 in first iteration, 1 to 2 in second and so on. We use this sequence to subset the rows from dataframe (df[seq_len(x), ]). Since the dataframe has only 1 column when we subset it , it changes it to a vector. To avoid that we add drop = FALSE.

Base R solution:
# Coerce df vector of data.frame to character, store as new data.frame: str_df => data.frame
str_df <- transform(df, df = as.character(df))
# Allocate some memory in order to split data into a list: df_list => empty list
df_list <- vector("list", nrow(str_df))
# Split the string version of the data.frame into a list as required:
# df_list => list of character vectors
df_list <- lapply(seq_len(nrow(str_df)), function(i){
str_df[if(i == 1){1}else{1:i}, grep("df", names(str_df))]
}
)
Data:
df <- c("one","two","three","four")
df <- as.data.frame(df)
df

Related

Subset dataframe using counter that resets to 1 and create dataframe for each subset

I have a dataframe that I need to break into multiple, smaller dataframes.
There is an integer index, which starts at 1 and counts up. When it resets to 1, I need to start creating a new dataframe.
df <- cbind(c(1,2,3,4,5,1,2,3,4), c("a","b","c","d","e","f","g","h","i"))
#end results should be:
df1 <- df[1:5, ]
df2 <- df[6:9, ]
How do I do this programmatically? I can find where all of the "1"s are, but how to I go row-wise and break it into different dataframes?
In your example, df is a character matrix, not a data.frame. To define a data.frame object use e.g. data.frame(index = c(1,2,3,4,5,1,2,3,4), value = c("a","b","c","d","e","f","g","h","i")
Find the index of the first value of each group, then split on groups. You do not need to perform any rowwise operation.
df <- data.frame(index = c(1,2,3,4,5,1,2,3,4), value = c("a","b","c","d","e","f","g","h","i"))
split(df, cumsum(df$index == 1))
result is a list of data.frame objects:
$`1`
index value
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
$`2`
index value
6 1 f
7 2 g
8 3 h
9 4 i
Try this approach with indexes and a loop. We create i1 to store the rows where there is 1. Then we compute the final position in i2. After that we create a list and use a loop to store the new data. Finally, we assign names and release to envir using list2env. Here the code:
#Create index
i1 <- which(df[,1]=='1')
i2 <- i1[-1]-1
#Test for dim
if(length(i2==1)){i2 <- c(i2,nrow(df))}
#Create a list
List <- list()
#Loop
for(j in 1:length(i1))
{
List[[j]] <- df[i1[j]:i2[j],]
}
#Assign names
names(List) <- paste0('df',1:length(List))
#Set to envir
list2env(List,envir = .GlobalEnv)

Nested for loop leading to: Error in [<-.data.frame`(`*tmp*` replacement has x rows, data has y

I have 6 data frames (dfs) with a lot of data of different biological groups and another 6 data frames (tax.dfs) with taxonomical information about those groups. I want to replace a column of each of the 6 dfs with a column with the scientific name of each species present in the 6 tax.dfs.
To do that I created two lists of the data frames and I'm trying to apply a nested for loop:
dfs <- list(df.birds, df.mammals, df.crocs, df.snakes, df.turtles, df.lizards)
tax.dfs <- list(tax.birds,tax.mammals, tax.crocs, tax.snakes, tax.turtles, tax.lizards )
for(i in dfs){
for(y in tax.dfs){
i[,1] <- y[,2]
}}
And this is the output I'm getting:
Error in `[<-.data.frame`(`*tmp*`, , 1, value = c("Aotus trivirgatus", :
replacement has 64 rows, data has 43
But both data frames have the same number of rows, I actually used dfs to create tax.dfs applying the tnrs_match_names function from rotl package.
Any suggestions of how I could fix this error or that help me to find another way to do what I need to will be greatly appreciated.
Thank You!
For what it is worth, to iterate over two objects simultaneously, the following works:
Example Data:
df1 <- data.frame(a=1, b=2)
df2 <- data.frame(c=3, d=4)
df3 <- data.frame(e=5, f=6)
df_1 <- data.frame(a='A', b='B')
df_2 <- data.frame(c='C', d='D')
df_3 <- data.frame(e='E', f='F')
dfs <- list(df1, df2, df3)
df_s <- list(df_1, df_2, df_3)
Using mapply:
out <- mapply(function(one, two) {
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
e f
1 F 6
Here, one and two in mapply correspond to the different elements in dfs and df_s. Having said that, let's make it a bit more interesting. Let's change my third example to the following:
df_3 <- data.frame(e=c('E', 'e'), f=c('F', 'f'))
df_s <- list(df_1, df_2, df_3) # needs to be executed again
Now, let's adjust the function:
out <- mapply(function(one, two) {
if(nrow(one) != nrow(two)){return('Wrong dimensions')}
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
[1] "Wrong dimensions"

How do you add a column to data frames in a list?

I have a list of data frames. I want to add a new column to each data frame. For example, I have three data frames as follows:
a = data.frame("Name" = c("John","Dor"))
b = data.frame("Name" = c("John2","Dor2"))
c = data.frame("Name" = c("John3","Dor3"))
I then put them into a list:
dfs = list(a,b,c)
I then want to add a new column with a unique value to each data frame, e.g.:
dfs[1]$new_column <- 5
But I get the following error:
"number of items to replace is not a multiple of replacement length"
I have also tried using two brackets:
dfs[[1]]$new_column <- 5
This does not return an error but it does not add the column.
This would be in a 'for' loop and a different value would be added to each data frame.
Any help would be much appreciated. Thanks in advance!
Let's say you want to add a new column with value 5:7 for each dataframe. We can use Map
new_value <- 5:7
Map(cbind, dfs, new_column = new_value)
#[[1]]
# Name new_column
#1 John 5
#2 Dor 5
#[[2]]
# Name new_column
#1 John2 6
#2 Dor2 6
#[[3]]
# Name new_column
#1 John3 7
#2 Dor3 7
With lapply you could do
lapply(seq_along(dfs), function(i) cbind(dfs[[i]], new_column = new_value[i]))
Or as #camille mentioned it works if you use [[ for indexing in the for loop
for (i in seq_along(dfs)) {
dfs[[i]]$new_column <- new_value[[i]]
}
The equivalent purrr version of this would be
library(purrr)
map2(dfs, new_value, cbind)
and
map(seq_along(dfs), ~cbind(dfs[[.]], new_colum = new_value[.]))

dplyr::bind_rows(...) vs. do.call(rbind, ...) for lists of symbols

Suppose I have the below data frames and character vector of names:
x <- data.frame(val = 1)
y <- data.frame(val = 2)
nms <- c("x", "y")
I want to simply row bind the data frames together. I can do this with do.call and rbind without issue:
library(dplyr)
do.call(rbind, syms(nms))
# val
#1 1
#2 2
However if I try dplyr::bind_rows I get a strange error telling me that argument 1 must be a data frame event though it is a data frame:
bind_rows(syms(nms))
#Error: Argument 1 must be a data frame or a named atomic vector, not a data.frame
Would appreciate if someone could tell why this occurs.
We can use mget to return the datasets in a list and then do the bind_rows
library(dplyr)
mget(nms) %>%
bind_rows
# val
#1 1
#2 2

R: Merging lists of data frames

I'm a total noob at R and I've tried (and retried) to search for an answer to the following problem, but I've not been able to get any of the proposed solutions to do what I'm interested in.
I have two lists of named elements, with each element pointing to data frames with identical layouts:
(EDIT)
df1 <- data.frame(A=c(1,2,3),B=c("A","B","C"))
df2 <- data.frame(A=c(98,99),B=c("Y","Z"))
lst1 <- c(X=df1,Y=df2)
df3 <- data.frame(A=c(4,5),B=c("D","E"))
lst2 <- c(X=df3)
(EDIT 2)
So it seems like storing multiple data frames in a list is a bad idea, as it will convert the data frames to lists. So I'll go out looking for an alternative way to store a set of named data frames.
In general the names of the elements in the two elements might overlap partially, completely, or not at all.
I'm looking for a way to merge the two lists into a single list:
<some-function-sequence>(lst1, lst2)
->
c(X=rbind(df1,df3),Y=df2)
-resulting in something like this:
[EDIT: Syntax changed to correctly reflect desired result (list-of-data frames)]
$X
A B
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
$X.B
A B
1 98 Y
2 99 Z
I.e:
IF the lists contain identical element names, each pointing to a data frame, THEN I want to 'rbind' the rows from these two data frames and assign the resulting data frame to the same element name in the resulting list.
Otherwise the element names and data frames from both lists should just be copied into the resulting list.
I've tried the solutions from a number of discussions such as:
Can I combine a list of similar dataframes into a single dataframe?
Combine/merge lists by elements names
Simultaneously merge multiple data.frames in a list
Combine/merge lists by elements names (list in list)
Convert a list of data frames into one data frame
-but I've not been able to find the right solution. A general problem seems to be that the data frame ends up being converted into a list by the application of 'mapply/sapply/merge/...' - and usually also sliced and/or merged in ways which I am not interested in. :)
Any help with this will be much appreciated!
[SOLUTION]
The solution seems to be to change the use of c(...) when collecting data frames to list(...) after which the solution proposed by Pierre seems to give the desired result.
Here is a proposed solution using split and c to combine like terms. Please read the caveat at the bottom:
s <- split(c(lst1, lst2), names(c(lst1,lst2)))
lapply(s, function(lst) do.call(function(...) unname(c(...)), lst))
# $X.A
# [1] 1 2 3 4 5
#
# $X.B
# [1] "A" "B" "C" "D" "E"
#
# $Y.A
# [1] 98 99
#
# $Y.B
# [1] "Y" "Z"
This solution is based on NOT having factors as strings. It will not throw an error but the factors will be converted to numbers. Below I show how I transformed the data to remove factors. Let me know if you require factors:
df1 <- data.frame(A=c(1,2,3),B=c("A","B","C"), stringsAsFactors=FALSE)
df2 <- data.frame(A=c(98,99),B=c("Y","Z"), stringsAsFactors=FALSE)
lst1 <- c(X=df1,Y=df2)
df3 <- data.frame(A=c(4,5),B=c("D","E"), stringsAsFactors=FALSE)
lst2 <- c(X=df3)
If the data is stored in lists we can use:
lapply(split(c(lst1, lst2), names(c(lst1,lst2))), function(lst) do.call(rbind, lst))
The following solution is probably not the most efficient way. However, if I got your problem right this should work ;)
# Example data
# Some vectors
a <- 1:5
b <- 3:7
c <- rep(5, 5)
d <- 5:1
# Some dataframes, data1 and data3 have identical column names
data1 <- data.frame(a, b)
data2 <- data.frame(c, b)
data3 <- data.frame(a, b)
data4 <- data.frame(c, d)
# 2 lists
list1 <- list(data1, data2)
list2 <- list(data3, data4)
# Loop, wich checks for the dataframe names and rbinds dataframes with the same column names
final_list <- list1
used_lists <- numeric()
for(i in 1:length(list1)) {
for(j in 1:length(list2)) {
if(sum(colnames(list1[[i]]) == colnames(list2[[j]])) == ncol(list1[[i]])) {
final_list[[i]] <- rbind(list1[[i]], list2[[j]])
used_lists <- c(used_lists, j)
}
}
}
# Adding the other dataframes, which did not have the same column names
for(i in 1:length(list2)) {
if((i %in% used_lists) == FALSE) {
final_list[[length(final_list) + 1]] <- list2[[i]]
}
}
# Final list, which includes all other lists
final_list

Resources