Flatten deeply nested list of dataframes - r

Consider this nested list of dataframes:
df <- data.frame(x = 1:5, y = letters[1:5])
l <- list(df, list(df, df), list(df, list(df, df, list(df))), list(df), df)
How can one get from this deeply nested list to a simple list of dataframes:
list(df, df, df, df, df, df, df, df, df)
Usual solutions (like here) fails to keep dataframes' structure.

A convenient option is to use rrapply:
rrapply::rrapply(l, classes = "data.frame", how = "flatten")
Check whether it's the same as the desired output:
identical(list(df, df, df, df, df, df, df, df, df),
rrapply(l, classes = "data.frame", how = "flatten"))
[1] TRUE

Or using a base R recursive function:
unnestdf <- function(x)
{
if (is.data.frame(x))
return(list(x))
if (!is.list(x))
return(NULL)
unlist(lapply(x, unnestdf), F)
}
unnestdf(l)
#> [[1]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[2]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[3]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[4]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[5]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[6]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[7]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[8]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[9]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e

Related

R + dplyr: Partial Deduplication of Rows in a Tibble

A very common question is how to remove all the duplicated lines in a data frame in R, something which can be done with a variety of tools (I like dplyr+distinct).
However, what if your dataset contains several duplicated lines, but you do not want to remove all of them, but only those for some combination of the variables?
I do not know how to achieve that, so any suggestion is welcome.
Please have a look at the reprex at the end of the post.
Thanks!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
df
#> # A tibble: 15 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A h
#> 7 2 B j
#> 8 3 C k
#> 9 4 D t
#> 10 5 E u
#> 11 1 A A
#> 12 2 B B
#> 13 3 C C
#> 14 4 D D
#> 15 5 E E
df_ded <- df |>
distinct()
df_ded
#> # A tibble: 10 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A A
#> 7 2 B B
#> 8 3 C C
#> 9 4 D D
#> 10 5 E E
## I want to deduplicate only the rows with x==3 and z=="k"
df_ded_partial <- df |>
distinct(x==3, z=="k") ## but this is not what I mean.
## How to achieve it?
df_ded_partial
#> # A tibble: 3 × 2
#> `x == 3` `z == "k"`
#> <lgl> <lgl>
#> 1 FALSE FALSE
#> 2 TRUE TRUE
#> 3 TRUE FALSE
Created on 2023-02-14 with reprex v2.0.2
We can use group_modify() and check for the condition using the .y argument which is a tibble of the current group. So we can say: if the condition is met return the distinct(.x) group otherwise return the whole group .x.
library(dplyr)
df |>
group_by(x, z) |>
group_modify(~ if(.y$x == 3 && .y$z == "k") distinct(.x) else .x)
#> # A tibble: 14 x 3
#> # Groups: x, z [10]
#> x z y
#> <int> <chr> <chr>
#> 1 1 A A
#> 2 1 h A
#> 3 1 h A
#> 4 2 B B
#> 5 2 j B
#> 6 2 j B
#> 7 3 C C
#> 8 3 k C
#> 9 4 D D
#> 10 4 t D
#> 11 4 t D
#> 12 5 E E
#> 13 5 u E
#> 14 5 u E
Data from OP
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
Created on 2023-02-14 by the reprex package (v2.0.1)

R: How do I add a column with values to a table, for every row in said table?

For example, if I have two lists:
x <- data.frame(c('a', 'b', 'c'))
y <- data.frame(c('1', '2', '3'))
I want my output to look like:
x
y
a
1
a
2
a
3
b
1
b
2
b
3
c
1
c
2
c
3
I sadly have no idea how such an operation is called, or where to start. Could anyone help me with a solution? Thanks!
Here are a few options:
library(tidyverse)
x <- data.frame(x = c('a', 'b', 'c'))
y <- data.frame(y = c('1', '2', '3'))
#option 1
expand.grid(x = x$x, y = y$y) |>
arrange(x)
#> x y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 1
#> 5 b 2
#> 6 b 3
#> 7 c 1
#> 8 c 2
#> 9 c 3
#option 2
map_dfr(x$x, ~tibble(x = .x, y = y$y))
#> # A tibble: 9 x 2
#> x y
#> <chr> <chr>
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 1
#> 5 b 2
#> 6 b 3
#> 7 c 1
#> 8 c 2
#> 9 c 3
#option 3
full_join(x, y, by = character())
#> x y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 1
#> 5 b 2
#> 6 b 3
#> 7 c 1
#> 8 c 2
#> 9 c 3
Using rep to repeat elements individually, then put them in a data frame.
data.frame(x = rep(x[, 1], each=nrow(y)), y = y[, 1])
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3

Keep Top Factors based on Grouped Weight

Please have a look at the snippet at the end of the post.
I am essentially looking for a cleaner way to obtain the same result.
I have a tibble where the x column is a character vector (I did not translate it into a factor, but this is actually what it is).
Each factor appears multiple times and it always has an associated numerical value (the w column in the tibble).
I would like to keep the top 4 factors according to the sum of their associated w values and change everything else into an "other" factor.
I achieve it below, but I wonder if there is a smarter way to do the same using e.g. forcats.
Any suggestion is appreciated
library(tidyverse)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)
It appears that I can pass a weight argument to fct_lump_n() so this works
library(tidyverse)
library(forcats)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
df_new2 <- df |>
mutate(x2=fct_lump_n(x,4, w))
df_new2
#> # A tibble: 100 × 3
#> x w x2
#> <chr> <int> <fct>
#> 1 a 1 Other
#> 2 b 2 Other
#> 3 c 3 Other
#> 4 d 4 Other
#> 5 e 5 Other
#> 6 f 6 Other
#> 7 g 7 g
#> 8 h 8 h
#> 9 i 9 i
#> 10 j 10 j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)

How can I change the label of row structure into a string?

I'm trying to change the type of structure in row label (the one with red rectangle) into a string(character). Any ideas/suggestion of how can I change it?
Set the rownames() for the data.frame. You might also find the rownames_to_column(), rowid_to_column(), and column_to_rownames() functions from the {tibble} package useful:
dat <- data.frame(x = 1:26)
head(dat)
#> x
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
rownames(dat) <- letters
head(dat)
#> x
#> a 1
#> b 2
#> c 3
#> d 4
#> e 5
#> f 6
tibble::rownames_to_column(dat, var = "rowname") |>
head()
#> rowname x
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
tibble::rowid_to_column(dat, var = "rowid") |>
head()
#> rowid x
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
dat <- data.frame(x = 1:26, rowname = letters)
head(dat)
#> x rowname
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
tibble::column_to_rownames(dat, var = "rowname") |>
head()
#> x
#> a 1
#> b 2
#> c 3
#> d 4
#> e 5
#> f 6
Created on 2022-07-22 by the reprex package (v2.0.1)

R+dplyr: conditionally swap the elements of two columns

Consider the dataframe df at the end of the post.
I simply would like to swap the elements of columns x and y whenever x>y.
There may be other columns in the dataframe which I do not want to touch.
In a sense, I would like to sort row wise the columns x and y.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df<-tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 × 3
#> # Rowwise:
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
Created on 2021-10-06 by the reprex package (v2.0.1)
base solution:
use which(df$x > df$y) to determine row numbers you want to change, then use rev to swap values for these:
df[which(df$x > df$y), c("x", "y")] <- rev(df[which(df$x > df$y), c("x", "y")])
df
# x y extra
# <int> <int> <chr>
# 1 1 10 A
# 2 2 9 B
# 3 3 8 C
# 4 4 7 D
# 5 5 6 E
# 6 5 6 F
# 7 4 7 G
# 8 3 8 H
# 9 2 9 I
# 10 1 10 J
Thanks everyone!
I wrote a small function which does what I need and generalizes to the case of multiple variables.
See the reprex
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(1234)
set_colnames <- `colnames<-`
df<-tibble(x=1:10, y=10:1, z=rnorm(10), extra=LETTERS[1:10]) %>%
rowwise()
df
#> # A tibble: 10 × 4
#> # Rowwise:
#> x y z extra
#> <int> <int> <dbl> <chr>
#> 1 1 10 -1.21 A
#> 2 2 9 0.277 B
#> 3 3 8 1.08 C
#> 4 4 7 -2.35 D
#> 5 5 6 0.429 E
#> 6 6 5 0.506 F
#> 7 7 4 -0.575 G
#> 8 8 3 -0.547 H
#> 9 9 2 -0.564 I
#> 10 10 1 -0.890 J
sort_rows <- function(df, col_names, dec=F){
temp <- df %>%
select(all_of(col_names))
extra_names <- setdiff(colnames(df), col_names)
temp2 <- df %>%
select(all_of(extra_names))
res <- t(apply(temp, 1, sort, decreasing=dec)) %>%
as_tibble %>%
set_colnames(col_names) %>%
bind_cols(temp2)
return(res)
}
col_names <- c("x", "y", "z")
df_s <- df %>%
sort_rows(col_names, dec=F)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
df_s
#> # A tibble: 10 × 4
#> x y z extra
#> <dbl> <dbl> <dbl> <chr>
#> 1 -1.21 1 10 A
#> 2 0.277 2 9 B
#> 3 1.08 3 8 C
#> 4 -2.35 4 7 D
#> 5 0.429 5 6 E
#> 6 0.506 5 6 F
#> 7 -0.575 4 7 G
#> 8 -0.547 3 8 H
#> 9 -0.564 2 9 I
#> 10 -0.890 1 10 J
Created on 2021-10-06 by the reprex package (v2.0.1)
This looks like sorting for me:
library(tidyverse)
df <- tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 x 3
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
extra_cols <- df %>% colnames() %>% setdiff(c("x", "y"))
extra_cols
#> [1] "extra"
df %>%
mutate(row = row_number()) %>%
pivot_longer(-c(row, extra_cols)) %>%
group_by_at(c("row", extra_cols)) %>%
transmute(
value = value %>% sort(),
name = c("x", "y"),
) %>%
pivot_wider() %>%
ungroup() %>%
select(-row)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(extra_cols)` instead of `extra_cols` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> # A tibble: 10 x 3
#> extra x y
#> <chr> <int> <int>
#> 1 A 1 10
#> 2 B 2 9
#> 3 C 3 8
#> 4 D 4 7
#> 5 E 5 6
#> 6 F 5 6
#> 7 G 4 7
#> 8 H 3 8
#> 9 I 2 9
#> 10 J 1 10
Created on 2021-10-06 by the reprex package (v2.0.1)
Try using apply on axis 1 and transpose it with t, then use as_tibble to convert it to a tibble.
Then finally change the column names:
> df <- as_tibble(t(apply(df, 1, sort)))
> names(df) <- c('x', 'y')
> df
# A tibble: 10 x 2
x y
<int> <int>
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10

Resources