Binding lists of data frames while preserving index - r

There are many posts on converting list of data frames to single data frame (ex. here). However, I need to do that while preserving information from which list
resulting rows are, i.e. I need to preserve an index of original list. If the list is unnamed, I need just an index number, while if the list is named I need to save the name of original list element. How can I do that?
With the data like below:
foo <- list(data.frame(x=c('a', 'b', 'c'),y = c(1,2,3)),
data.frame(x=c('d', 'e', 'f'),y = c(4,5,6)))
I need an output like this:
index x y
1 1 a 1
2 1 b 2
3 1 c 3
4 2 d 4
5 2 e 5
6 2 f 6
while with named elements of a list:
foo <- list(df1 = data.frame(x=c('a', 'b', 'c'),y = c(1,2,3)),
df2 = data.frame(x=c('d', 'e', 'f'),y = c(4,5,6)))
the output would be:
index x y
1 df1 a 1
2 df1 b 2
3 df1 c 3
4 df2 d 4
5 df2 e 5
6 df2 f 6

We can use imap to get the index
library(purrr)
imap_dfr(foo, ~ .x %>%
mutate(index = .y))
Or with map
map_dfr(foo, .f = cbind, .id = 'index')
# index x y
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 2 d 4
#5 2 e 5
#6 2 f 6
Or use Map from base R, where we loop through the elements of 'foo' and a corresponding sequence of 'foo', cbind to create a new column and then rbind the list elements
do.call(rbind, Map(cbind, index = seq_along(foo), foo))
# index x y
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 2 d 4
#5 2 e 5
#6 2 f 6

You could name the list if it is unnamed and then loop over them using lapply to create list of dataframes and rbind them together as one dataframe.
names(foo) <- if (is.null(names(foo))) seq_len(length(foo)) else names(foo)
do.call("rbind", lapply(names(foo), function(x) cbind(index = x, foo[[x]])))
# index x y
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 2 d 4
#5 2 e 5
#6 2 f 6
For named list
do.call("rbind", lapply(names(foo), function(x) cbind(index = x, foo[[x]])))
# index x y
#1 df1 a 1
#2 df1 b 2
#3 df1 c 3
#4 df2 d 4
#5 df2 e 5
#6 df2 f 6

Related

Adding an index column representing a repetition of a dataframe in R

I have a dataframe in R that I'd like to repeat several times, and I want to add in a new variable to index those repetitions. The best I've come up with is using mutate + rbind over and over, and I feel like there has to be an efficient dataframe method I could be using here.
Here's an example: df <- data.frame(x = 1:3, y = letters[1:3]) gives us the dataframe
x
y
1
a
2
b
3
c
I'd like to repeat that say 3 times, with an index that looks like this:
x
y
index
1
a
1
2
b
1
3
c
1
1
a
2
2
b
2
3
c
2
1
a
3
2
b
3
3
c
3
Using the rep function, I can get the first two columns, but not the index column. The best I've come up with so far (using dplyr) is:
df2 <-
df %>%
mutate(index = 1) %>%
rbind(df %>% mutate(index = 2)) %>%
rbind(df %>% mutate(index = 3))
This obviously doesn't work if I need to repeat my dataframe more than a handful of times. It feels like the kind of thing that should be easy to do using dataframe methods, but I haven't been able to find anything.
Grateful for any tips!
You can use this code for as many data frames as you would like. You just have to set the n argument:
replicate function takes 2 main arguments. We first specify the number of time we would like to reproduce our data set by n. Then we specify our data set as expr argument. The result would be a list whose elements are instances of our data set
After that we pass it along to imap function from purrr package to define the unique id for each of our data set. .x represents each element of our list (here a data frame) and .y is the position of that element which amounts to the number of instances we created. So for example we assign value 1 to the first id column of the first data set as .y is equal to 1 for that and so on.
library(dplyr)
library(purrr)
replicate(3, df, simplify = FALSE) %>%
imap_dfr(~ .x %>%
mutate(id = .y))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
In base R you can use the following code:
do.call(rbind,
mapply(function(x, z) {
x$id <- z
x
}, replicate(3, df, simplify = FALSE), 1:3, SIMPLIFY = FALSE))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
You can use rerun to repeat the dataframe n times and add an index column using bind_rows -
library(dplyr)
library(purrr)
n <- 3
df <- data.frame(x = 1:3, y = letters[1:3])
bind_rows(rerun(n, df), .id = 'index')
# index x y
#1 1 1 a
#2 1 2 b
#3 1 3 c
#4 2 1 a
#5 2 2 b
#6 2 3 c
#7 3 1 a
#8 3 2 b
#9 3 3 c
In base R, we can repeat the row index 3 times.
transform(df[rep(1:nrow(df), n), ], index = rep(1:n, each = nrow(df)))
One more way
n <- 3
map_dfr(seq_len(n), ~ df %>% mutate(index = .x))
x y index
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3

I want to amend several columns in a consecutive way based on serial ID without rbind as columns headings are not identical

Instead of copy and paste corresponding columns into excel, I want to amend several columns in a consecutive way based on serial ID named addr.
Assume my data sets are like these
df1 <- data.frame(addr=c('a','b','c','d'),
num = c(1,2,3,4),
x=c(1, NA,4,5));df1
df2 <- data.frame(addr=c('e','f','g'),
num=c(100,200,500));df2
var<-intersect(names(df), names(df2));var
combined.df<-merge(x = df1, y = df2, by = var, all=T);combined.df
df3 <- data.frame(addr=c('e','f','g'),
x=c(5,7,NA));df3
var<-intersect(names(df3), names(combined.df));var
combined.df<-merge(x = combined.df, y = df3, by = var, all=T);combined.df
The current output is
addr x num
1 a 1 1
2 b NA 2
3 c 4 3
4 d 5 4
5 e 5 NA
6 e NA 100
7 f 7 NA
8 f NA 200
9 g NA 500
The desired output is
addr x num
1 a 1 1
2 b NA 2
3 c 4 3
4 d 5 4
5 e 5 100
6 f 7 200
7 g NA 500
i.e.: Overwrite empty columns without deleting prior full cells
Any advice will be greatly appreciated
If we want to automate using a for loop, place the datasets in a list except the first one, then create a copy of the first dataset as 'out', loop over the sequence of the list, merge the first one i.e 'out' with the corresponding list elements, specify the by as intersect of names of both datasets and update by assigning (<-) back to the 'out'
out <- df1
lst1 <- list(df2, df3)
for(i in seq_along(lst1)) {
out <- merge(out, lst1[[i]],
by = intersect(names(out), names(lst1[[i]])), all = TRUE)
}
Then, we change the output by grouping over the 'addr', and summarise across all other columns by removing the NA if there exist a non-NA element
library(dplyr)
out %>%
group_by(addr) %>%
summarise(across(everything(),
~ if(all(is.na(.))) NA_real_ else .[!is.na(.)]), .groups = 'drop')
-output
# addr x num
# <chr> <dbl> <dbl>
#1 a 1 1
#2 b NA 2
#3 c 4 3
#4 d 5 4
#5 e 5 100
#6 f 7 200
#7 g NA 500

From axis values to coodinates pairs [duplicate]

I have two vectors of integers, say v1=c(1,2) and v2=c(3,4), I want to combine and obtain this as a result (as a data.frame, or matrix):
> combine(v1,v2) <--- doesn't exist
1 3
1 4
2 3
2 4
This is a basic case. What about a little bit more complicated - combine every row with every other row? E.g. imagine that we have two data.frames or matrices d1, and d2, and we want to combine them to obtain the following result:
d1
1 13
2 11
d2
3 12
4 10
> combine(d1,d2) <--- doesn't exist
1 13 3 12
1 13 4 10
2 11 3 12
2 11 4 10
How could I achieve this?
For the simple case of vectors there is expand.grid
v1 <- 1:2
v2 <- 3:4
expand.grid(v1, v2)
# Var1 Var2
#1 1 3
#2 2 3
#3 1 4
#4 2 4
I don't know of a function that will automatically do what you want to do for dataframes(See edit)
We could relatively easily accomplish this using expand.grid and cbind.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
expand.grid(df1, df2) # doesn't work so let's try something else
id <- expand.grid(seq(nrow(df1)), seq(nrow(df2)))
out <-cbind(df1[id[,1],], df2[id[,2],])
out
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#1.1 1 3 6 b
#2.1 2 4 6 b
Edit: As Joran points out in the comments merge does this for us for data frames.
df1 <- data.frame(a = 1:2, b=3:4)
df2 <- data.frame(cat = 5:6, dog = c("a","b"))
merge(df1, df2)
# a b cat dog
#1 1 3 5 a
#2 2 4 5 a
#3 1 3 6 b
#4 2 4 6 b

Reshape data frame by vector

Lets say a function called textstat_frequency{package:quanteda}
gives us following data frame.
data.frame(xx=1:4,yy=5:8,foo=c("A","A","B","C"),stringsAsFactors=FALSE)
xx yy foo
1 1 5 A
2 2 6 A
3 3 7 B
4 4 8 C
What's the best way to shape the data.frame according the vector
c("B","A","C"). I have made an index with match or %in% but without any luck.
df = data.frame(xx=1:4,yy=5:8,foo=c("A","A","B","C"),stringsAsFactors=FALSE)
temp = factor(df$foo, levels = c("B", "A", "C"))
df = df[order(temp),]
df
# xx yy foo
#3 3 7 B
#1 1 5 A
#2 2 6 A
#4 4 8 C

How to use stack to produce multiple columns data frame?

I want to convert a list of lists into a data.frame. First I each sublist was only of length 1 and so I used stack(as.data.frame(...)) but stack does not seam to be able to produce multicolumns data.frame. So what it the best way to achieve that:
# works fine with only sublists of length 1
l = list(a = sample(1:5, 5), b = sample(1:5, 5))
> stack(as.data.frame(l))
values ind
1 5 a
2 4 a
3 1 a
4 2 a
5 3 a
6 2 b
7 1 b
8 3 b
9 5 b
10 4 b
Now my list is a list of lists:
l = list(a = list(first = sample(1:5, 5), sec = sample(1:5, 5)), b = list(first = sample(1:5, 5), sec = sample(1:5, 5)))
stack(as.data.frame(l))
values ind
1 4 a.first
2 5 a.first
3 3 a.first
4 1 a.first
5 2 a.first
6 3 a.sec
7 5 a.sec
8 1 a.sec
9 2 a.sec
10 4 a.sec
11 5 b.first
12 4 b.first
13 3 b.first
14 1 b.first
15 2 b.first
16 3 b.sec
17 4 b.sec
18 1 b.sec
19 2 b.sec
20 5 b.sec
while I'd like to have still a column ind with a and b and two columns first and sec
We can flatten the list by concatenating (c) the nested elements ('l1'), get the substring from the names of 'l1' ('nm1' and 'nm2'), split the 'l1' by 'nm1' (i.e. substring obtained by removing the prefix) while we set the names of 'l1' with 'nm2' (substring obtained by removing suffix starting with .), loop through the list and stack it ('lst'). Then, we cbind the 'ind' column (which is the same in all the list elements so we get it from the first list element - lst[[1]][2]) with the 'value' column i.e. the first column.
l1 <- do.call(c, l)
nm1 <- sub("[^.]+\\.", "", names(l1))
nm2 <- sub("\\..*", "", names(l1))
lst <- lapply(split(setNames(l1, nm2), nm1), stack)
cbind(lst[[1]][2],lapply(lst, `[[`, 1))
# ind first sec
#1 a 1 1
#2 a 5 5
#3 a 4 4
#4 a 3 3
#5 a 2 2
#6 b 3 4
#7 b 4 5
#8 b 2 2
#9 b 1 3
#10 b 5 1
Or using dplyr/purrr we can get the expected output.
library(purrr)
library(dplyr)
l1 <- transpose(l)
n1 <- names(l1)
l1 %>%
map(stack) %>%
bind_cols %>%
setNames(., make.unique(names(.))) %>%
select(ind, matches("value")) %>%
setNames(., c("ind", n1))
# ind first sec
# (fctr) (int) (int)
#1 a 1 1
#2 a 5 5
#3 a 4 4
#4 a 3 3
#5 a 2 2
#6 b 3 4
#7 b 4 5
#8 b 2 2
#9 b 1 3
#10 b 5 1
Here is another approach:
df <- stack(as.data.frame(l))
# split names of variables
indVars <- strsplit(as.character(df$ind), split="\\.")
# add variables to data.frame
df$letters <- sapply(indVars, function(i) i[1])
df$order <- sapply(indVars, function(i) i[2])
# get final data.frame
cbind("order"=unstack(df, letters~order)[,1], unstack(df, values~order))

Resources