How is it possible to concatenate a dataframe that contains one or more data.frames among its columns. For example:
df <- data.frame(a=1:3)
df$df <- data.frame(a=1:3)
rbind( df, df)
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘2’, ‘3’
library(dplyr)
bind_rows(list(df,df))
Error: Argument 2 can't be a list containing data frames
The issue here seems to be not another data.frame within a data frame, but the non-unique rownames in the result. If you made sure that rownames are unique after rbind - it should work:
df1 <- data.frame(a=1:3)
df2 <- data.frame(a=1:3)
df1$df <- data.frame(a=1:3, row.names=letters[1:3])
df2$df <- data.frame(a=1:3, row.names=LETTERS[1:3])
> res <- rbind(df1, df2)
> res
a a
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
> res$df
a
a 1
b 2
c 3
A 1
B 2
C 3
The problem seems to be that rbind adjusts the rownames for the two data.frames being merged, but does not adjust the rownames for data.frames within data.frames.
One option would be to replicate df twice (or more) instead of rbind-ing it; this will automatically create non duplicated row.names. Try this:
df[rep(seq_len(nrow(df)), 2), ]
# output
a a
1 1 1
2 2 2
3 3 3
1.1 1 1
2.1 2 2
3.1 3 3
The same process using dplyr will give you more interesting row.names:
library(dplyr)
df %>% slice(rep(row_number(), 2))
# output
a a
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
We may list the data frames, then using mapply to handle column types differently: stack for vectors and do.call(rbind) for data.frames.
L <- mget(ls(pattern="df\\.")) # or list(df.1, df.2, df.3)
res <- data.frame(a=stack(mapply(`[`, L, 1))[[1]])
res$df <- do.call(rbind, mapply(`[`, L, 2))
res
# a a
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
# 6 6 6
# 7 7 7
# 8 8 8
# 9 9 9
str(res)
# 'data.frame': 9 obs. of 2 variables:
# $ a : int 1 2 3 4 5 6 7 8 9
# $ df:'data.frame': 9 obs. of 1 variable:
# ..$ a: int 1 2 3 4 5 6 7 8 9
Data
df.1 <- structure(list(a = 1:3, df = structure(list(a = 1:3), class = "data.frame", row.names = c(NA,
-3L))), row.names = c(NA, -3L), class = "data.frame")
df.2 <- structure(list(a = 4:6, df = structure(list(a = 4:6), class = "data.frame", row.names = c(NA,
-3L))), row.names = c(NA, -3L), class = "data.frame")
df.3 <- structure(list(a = 7:9, df = structure(list(a = 7:9), class = "data.frame", row.names = c(NA,
-3L))), row.names = c(NA, -3L), class = "data.frame")
Related
In R, I want to separate numbers that are in the same column. My data appear like this:
id time
1 1,2
2 3,4
3 4,5,6
I want it to appear like this:
1 1
1 2
2 3
2 4
3 4
3 5
3 6
Though not shown, there are different iterations of time that vary depending on the id. For example:
4 1,6,7
5 1,3,6
6 1,4,5
7 1,3,5
8 2,3,4
There are 100 ids and the time column has different #s that vary in order as shown above.
Does anyone have advice to do this?
An option with separate_rows
library(dplyr)
library(tidyr)
df %>%
separate_rows(time, sep = "(?<=.)(?=.)", convert = TRUE)
# A tibble: 4 x 2
# id time
# <dbl> <int>
#1 1 1
#2 1 2
#3 2 3
#4 2 4
data
df <- structure(list(id = c(1, 2), time = c(12, 34)), class = "data.frame",
row.names = c(NA,
-2L))
Using tidyverse you could try the following. Make sure time is character type, and use strsplit to split up into single characters.
library(tidyverse)
df %>%
mutate(time = strsplit(as.character(time), ",")) %>%
unnest(cols = time)
Or you can just use separate_rows and indicate comma as separator:
df %>%
separate_rows(time, sep = ',')
Or in base R you could try this:
s <- strsplit(df$time, ',', fixed = T)
data.frame(id = unlist(s), time = rep(df$id, lengths(s)))
Output
# A tibble: 10 x 2
id time
<int> <chr>
1 1 1
2 1 2
3 2 3
4 2 4
5 3 4
6 3 5
7 3 6
8 4 1
9 4 6
10 4 7
Data
df <- structure(list(id = 1:4, time = c("1,2", "3,4", "4,5,6", "1,6,7"
)), class = "data.frame", row.names = c(NA, -4L))
I want to apply a filter_at over a list of dataframes. I can apply it to a single dataframe within this list like so:
dat_list[[1]] <- dat_list[[1]] %>% filter_at(vars(c("test", "x")), all_vars(!is.na(.)))
Here is the test dataset:
dat1 <- structure(list(id = 1:3, test = 4:6, x = 7:9), class = "data.frame", row.names = c(NA,-3L))
dat2 <- structure(list(id = 1:3, test = 4:6, x = 7:9), class = "data.frame", row.names = c(NA,-3L))
dat3 <- structure(list(id = 1:3, test = 4:6, x = 7:9), class = "data.frame", row.names = c(NA,-3L))
dat1[1,2] <- NA
dat1[1,3] <- NA
dat1[3,2] <- NA
dat1[3,3] <- NA
dat3[1,2] <- NA
dat3[1,3] <- NA
dat3[3,2] <- NA
dat3[3,3] <- NA
dat_list <- list(dat1, dat2, dat3)
Using tidyverse:
library(dplyr)
library(purrr)
dat_list2 <- map(dat_list, ~filter_at(., vars(c("test", "x")), all_vars(!is.na(.))))
dat_list2
#> [[1]]
#> id test x
#> 1 2 5 8
#>
#> [[2]]
#> id test x
#> 1 1 4 7
#> 2 2 5 8
#> 3 3 6 9
#>
#> [[3]]
#> id test x
#> 1 2 5 8
Created on 2020-07-08 by the reprex package (v0.3.0)
With dplyr 1.0.0, we can use filter with across
library(dplyr)#1.0.0
library(purrr)
dat_list %>%
map(~ .x %>% filter(across(c(test, x), ~ !is.na(.x))))
#[[1]]
# id test x
#1 2 5 8
#[[2]]
# id test x
#1 1 4 7
#2 2 5 8
#3 3 6 9
#[[3]]
# id test x
#1 2 5 8
Let's say I have the following dfs
df1:
a b c d
1 2 3 4
4 3 3 4
9 7 3 4
df2:
a b c d
1 2 3 4
2 2 3 4
3 2 3 4
Now I want to merge both dfs conditional of column "a" to give me the following df
a b c d
1 2 3 4
4 3 3 4
9 7 3 4
2 2 3 4
3 2 3 4
In my dataset i tried using
merge <- merge(x = df1, y = df2, by = "a", all = TRUE)
However, while df1 has 50,000 entries and df2 has 100,000 entries and there are definately matching values in column a the merged df has over one million entries. I do not understand this. As I understand there should be max. 150,000 entries in the merged df and this is the case when no values in column a are equal between the two dfs.
I think what you want to do is not mergebut rather rbind the two dataframes and remove the duplicated rows:
DATA:
df1 <- data.frame(a = c(1,4,9),
b = c(2,3,7),
c = c(3,3,3),
d = c(4,4,4))
df2 <- data.frame(a = c(1,2,3),
b = c(2,2,2),
c = c(3,3,3),
d = c(4,4,4))
SOLUTION:
Row-bind df1and df2:
df3 <- rbind(df1, df2)
Remove the duplicate rows:
df3 <- df3[!duplicated(df3), ]
RESULT:
df3
a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
5 2 2 3 4
6 3 2 3 4
With tidyverse, we can do bind_rows and distinct
library(dplyr)
bind_rows(df1, df2) %>%
distinct
data
df1 <- structure(list(a = c(1, 4, 9), b = c(2, 3, 7), c = c(3, 3, 3),
d = c(4, 4, 4)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(a = c(1, 2, 3), b = c(2, 2, 2), c = c(3, 3, 3),
d = c(4, 4, 4)), class = "data.frame", row.names = c(NA,
-3L))
it is possible so
dplyr::union(df1, df2)
here is another base R solution using rbind + %in%
dfout <- rbind(df1,subset(df2,!a %in% df1$a))
such that
> rbind(df1,subset(df2,!a %in% df1$a))
a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
21 2 2 3 4
31 3 2 3 4
I have the following frame:
df <- structure(list(returns = list(c(1,2,3,4,5,6), c(7,8,9,10,11,12)), indexId = c("a", "b")), class = "data.frame", row.names = 1:2)
Is there an easy way to convert this into a normal data.frame so it appears as:
Choice ppl
1 a
2 a
3 a
4 a
5 a
6 a
7 b
8 b
9 b
10 b
11 b
12 b
I have a solution using For but I am looking for something simpler.
All help is much appreciated!
df <- structure(list(returns = list(c(1,2,3,4,5,6), c(7,8,9,10,11,12)),
indexId = c("a", "b")), class = "data.frame", row.names = 1:2)
library(tidyverse)
df %>% separate_rows()
# returns indexId
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 a
# 5 5 a
# 6 6 a
# 7 7 b
# 8 8 b
# 9 9 b
# 10 10 b
# 11 11 b
# 12 12 b
Or :
data.frame(choice = unlist(df$returns), ppl = rep(df$indexId, lapply(df$returns, length)))
I'm trying to find an elegant way to work with list structures in R. In particular, in this case, I'd like to extract sub-elements from a list, modify them based on their associated data in that list, and concatenate them into a data frame. Perhaps easier with an example:
mystruct <- structure(list(dataset1 = structure(list(data1 = structure(list(
a = c(1, 2, 3), b = c(4, 5, 6)), .Names = c("a", "b"), row.names = c(NA,
-3L), class = "data.frame"), data2 = c("a", "b", "c", "d", "e"
)), .Names = c("data1", "data2")), dataset2 = structure(list(
data1 = structure(list(a = c(7, 8, 9), b = c(10, 11, 12)), .Names = c("a",
"b"), row.names = c(NA, -3L), class = "data.frame"), data2 = c("f",
"g", "h", "i", "j")), .Names = c("data1", "data2"))), .Names = c("dataset1",
"dataset2"))
I can concatenate data1 elements like this:
> mystruct %>% map_dfr(~.x$data1)
a b
1 1 4
2 2 5
3 3 6
4 7 10
5 8 11
6 9 12
But I would like to add a "dataset" column, which is populated by the name of the list element from whence the data was taken:
dataset a b
1 dataset1 1 4
2 dataset1 2 5
3 dataset1 3 6
4 dataset2 7 10
5 dataset2 8 11
6 dataset2 9 12
Is there a way to do this nicely with the tidyverse? I'd also be open to data.table solutions.
Thanks,
Allie
Provide an .id parameter to map_df, which will create a column giving the name of the list:
map_df(mystruct, 'data1', .id='dataset')
# dataset a b
#1 dataset1 1 4
#2 dataset1 2 5
#3 dataset1 3 6
#4 dataset2 7 10
#5 dataset2 8 11
#6 dataset2 9 12
Or map_dfr should work as well:
map_dfr(mystruct, 'data1', .id='dataset')
map_dfr has an .id argument:
mystruct %>% map_dfr(~ .x$data1, .id = "id")
giving:
id a b
1 dataset1 1 4
2 dataset1 2 5
3 dataset1 3 6
4 dataset2 7 10
5 dataset2 8 11
6 dataset2 9 12
Restructure as a "tidy" table with list columns...
library(data.table)
tabstruct = rbindlist(lapply(mystruct, lapply, list), id = TRUE)
# .id data1 data2
# 1: dataset1 <data.frame> a,b,c,d,e
# 2: dataset2 <data.frame> f,g,h,i,j
Then "unnest" data1:
tabstruct[, rbindlist(setNames(data1, .id), id=TRUE)]
# .id a b
# 1: dataset1 1 4
# 2: dataset1 2 5
# 3: dataset1 3 6
# 4: dataset2 7 10
# 5: dataset2 8 11
# 6: dataset2 9 12
Or unnest data2:
tabstruct[, .(val = unlist(data2)), by=.id]
# .id val
# 1: dataset1 a
# 2: dataset1 b
# 3: dataset1 c
# 4: dataset1 d
# 5: dataset1 e
# 6: dataset2 f
# 7: dataset2 g
# 8: dataset2 h
# 9: dataset2 i
# 10: dataset2 j
Here is an option to do this on multiple datasets in the list
map(c('data1', 'data2'), ~
map2_df(mystruct, .x, ~ .x[[.y]], .id = 'id'))
#[[1]]
# id a b
#1 dataset1 1 4
#2 dataset1 2 5
#3 dataset1 3 6
#4 dataset2 7 10
#5 dataset2 8 11
#6 dataset2 9 12
#[[2]]
# A tibble: 5 x 3
# id dataset1 dataset2
# <chr> <chr> <chr>
#1 1 a f
#2 1 b g
#3 1 c h
#4 1 d i
#5 1 e j