How can I change the label of row structure into a string? - r

I'm trying to change the type of structure in row label (the one with red rectangle) into a string(character). Any ideas/suggestion of how can I change it?

Set the rownames() for the data.frame. You might also find the rownames_to_column(), rowid_to_column(), and column_to_rownames() functions from the {tibble} package useful:
dat <- data.frame(x = 1:26)
head(dat)
#> x
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
rownames(dat) <- letters
head(dat)
#> x
#> a 1
#> b 2
#> c 3
#> d 4
#> e 5
#> f 6
tibble::rownames_to_column(dat, var = "rowname") |>
head()
#> rowname x
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
tibble::rowid_to_column(dat, var = "rowid") |>
head()
#> rowid x
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
dat <- data.frame(x = 1:26, rowname = letters)
head(dat)
#> x rowname
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
tibble::column_to_rownames(dat, var = "rowname") |>
head()
#> x
#> a 1
#> b 2
#> c 3
#> d 4
#> e 5
#> f 6
Created on 2022-07-22 by the reprex package (v2.0.1)

Related

R + dplyr: Partial Deduplication of Rows in a Tibble

A very common question is how to remove all the duplicated lines in a data frame in R, something which can be done with a variety of tools (I like dplyr+distinct).
However, what if your dataset contains several duplicated lines, but you do not want to remove all of them, but only those for some combination of the variables?
I do not know how to achieve that, so any suggestion is welcome.
Please have a look at the reprex at the end of the post.
Thanks!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
df
#> # A tibble: 15 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A h
#> 7 2 B j
#> 8 3 C k
#> 9 4 D t
#> 10 5 E u
#> 11 1 A A
#> 12 2 B B
#> 13 3 C C
#> 14 4 D D
#> 15 5 E E
df_ded <- df |>
distinct()
df_ded
#> # A tibble: 10 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A A
#> 7 2 B B
#> 8 3 C C
#> 9 4 D D
#> 10 5 E E
## I want to deduplicate only the rows with x==3 and z=="k"
df_ded_partial <- df |>
distinct(x==3, z=="k") ## but this is not what I mean.
## How to achieve it?
df_ded_partial
#> # A tibble: 3 × 2
#> `x == 3` `z == "k"`
#> <lgl> <lgl>
#> 1 FALSE FALSE
#> 2 TRUE TRUE
#> 3 TRUE FALSE
Created on 2023-02-14 with reprex v2.0.2
We can use group_modify() and check for the condition using the .y argument which is a tibble of the current group. So we can say: if the condition is met return the distinct(.x) group otherwise return the whole group .x.
library(dplyr)
df |>
group_by(x, z) |>
group_modify(~ if(.y$x == 3 && .y$z == "k") distinct(.x) else .x)
#> # A tibble: 14 x 3
#> # Groups: x, z [10]
#> x z y
#> <int> <chr> <chr>
#> 1 1 A A
#> 2 1 h A
#> 3 1 h A
#> 4 2 B B
#> 5 2 j B
#> 6 2 j B
#> 7 3 C C
#> 8 3 k C
#> 9 4 D D
#> 10 4 t D
#> 11 4 t D
#> 12 5 E E
#> 13 5 u E
#> 14 5 u E
Data from OP
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
Created on 2023-02-14 by the reprex package (v2.0.1)

tidy syntax for matrix to tibble by index?

I have a matrix foo and want to create a data.frame or tibble like bar with the data in a long format with the indices as columns. What's a simple way to do this in the tidyverse?
z <- c(1,8,6,4,7,3,2,4,7)
foo <- matrix(z,3,3)
bar <- expand.grid(j=1:3,i=1:3)
bar$z <- z
foo
bar
Here are two ways.
The first is in fact a base R solution, just change magrittr's pipe for R's native pipe operator |>.
The second is a tidyverse solution which I find too complicated.
suppressPackageStartupMessages(
library(tidyverse)
)
z <- c(1,8,6,4,7,3,2,4,7)
foo <- matrix(z,3,3)
bar <- expand.grid(j=1:3,i=1:3)
bar$z <- z
cbind(
i = foo %>% row() %>% c(),
j = foo %>% col() %>% c(),
z = foo %>% c()
) %>%
as.data.frame()
#> i j z
#> 1 1 1 1
#> 2 2 1 8
#> 3 3 1 6
#> 4 1 2 4
#> 5 2 2 7
#> 6 3 2 3
#> 7 1 3 2
#> 8 2 3 4
#> 9 3 3 7
foo %>%
t() %>%
as.data.frame() %>%
pivot_longer(everything(), values_to = "z") %>%
mutate(i = c(row(foo)), j = c(col(foo))) %>%
select(-name) %>%
relocate(z, .after = j)
#> # A tibble: 9 × 3
#> i j z
#> <int> <int> <dbl>
#> 1 1 1 1
#> 2 2 1 8
#> 3 3 1 6
#> 4 1 2 4
#> 5 2 2 7
#> 6 3 2 3
#> 7 1 3 2
#> 8 2 3 4
#> 9 3 3 7
Created on 2022-10-12 with reprex v2.0.2
Another base R method would be to take advantage of as.table and as.data.frame
as.data.frame(lapply(as.data.frame(as.table(foo)), as.numeric),
col.names = c("row", "col", "val"))
#> row col val
#> 1 1 1 1
#> 2 2 1 8
#> 3 3 1 6
#> 4 1 2 4
#> 5 2 2 7
#> 6 3 2 3
#> 7 1 3 2
#> 8 2 3 4
#> 9 3 3 7

Keep Top Factors based on Grouped Weight

Please have a look at the snippet at the end of the post.
I am essentially looking for a cleaner way to obtain the same result.
I have a tibble where the x column is a character vector (I did not translate it into a factor, but this is actually what it is).
Each factor appears multiple times and it always has an associated numerical value (the w column in the tibble).
I would like to keep the top 4 factors according to the sum of their associated w values and change everything else into an "other" factor.
I achieve it below, but I wonder if there is a smarter way to do the same using e.g. forcats.
Any suggestion is appreciated
library(tidyverse)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)
It appears that I can pass a weight argument to fct_lump_n() so this works
library(tidyverse)
library(forcats)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
df_new2 <- df |>
mutate(x2=fct_lump_n(x,4, w))
df_new2
#> # A tibble: 100 × 3
#> x w x2
#> <chr> <int> <fct>
#> 1 a 1 Other
#> 2 b 2 Other
#> 3 c 3 Other
#> 4 d 4 Other
#> 5 e 5 Other
#> 6 f 6 Other
#> 7 g 7 g
#> 8 h 8 h
#> 9 i 9 i
#> 10 j 10 j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)

Flatten deeply nested list of dataframes

Consider this nested list of dataframes:
df <- data.frame(x = 1:5, y = letters[1:5])
l <- list(df, list(df, df), list(df, list(df, df, list(df))), list(df), df)
How can one get from this deeply nested list to a simple list of dataframes:
list(df, df, df, df, df, df, df, df, df)
Usual solutions (like here) fails to keep dataframes' structure.
A convenient option is to use rrapply:
rrapply::rrapply(l, classes = "data.frame", how = "flatten")
Check whether it's the same as the desired output:
identical(list(df, df, df, df, df, df, df, df, df),
rrapply(l, classes = "data.frame", how = "flatten"))
[1] TRUE
Or using a base R recursive function:
unnestdf <- function(x)
{
if (is.data.frame(x))
return(list(x))
if (!is.list(x))
return(NULL)
unlist(lapply(x, unnestdf), F)
}
unnestdf(l)
#> [[1]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[2]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[3]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[4]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[5]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[6]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[7]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[8]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#>
#> [[9]]
#> x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e

R+dplyr: conditionally swap the elements of two columns

Consider the dataframe df at the end of the post.
I simply would like to swap the elements of columns x and y whenever x>y.
There may be other columns in the dataframe which I do not want to touch.
In a sense, I would like to sort row wise the columns x and y.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df<-tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 × 3
#> # Rowwise:
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
Created on 2021-10-06 by the reprex package (v2.0.1)
base solution:
use which(df$x > df$y) to determine row numbers you want to change, then use rev to swap values for these:
df[which(df$x > df$y), c("x", "y")] <- rev(df[which(df$x > df$y), c("x", "y")])
df
# x y extra
# <int> <int> <chr>
# 1 1 10 A
# 2 2 9 B
# 3 3 8 C
# 4 4 7 D
# 5 5 6 E
# 6 5 6 F
# 7 4 7 G
# 8 3 8 H
# 9 2 9 I
# 10 1 10 J
Thanks everyone!
I wrote a small function which does what I need and generalizes to the case of multiple variables.
See the reprex
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(1234)
set_colnames <- `colnames<-`
df<-tibble(x=1:10, y=10:1, z=rnorm(10), extra=LETTERS[1:10]) %>%
rowwise()
df
#> # A tibble: 10 × 4
#> # Rowwise:
#> x y z extra
#> <int> <int> <dbl> <chr>
#> 1 1 10 -1.21 A
#> 2 2 9 0.277 B
#> 3 3 8 1.08 C
#> 4 4 7 -2.35 D
#> 5 5 6 0.429 E
#> 6 6 5 0.506 F
#> 7 7 4 -0.575 G
#> 8 8 3 -0.547 H
#> 9 9 2 -0.564 I
#> 10 10 1 -0.890 J
sort_rows <- function(df, col_names, dec=F){
temp <- df %>%
select(all_of(col_names))
extra_names <- setdiff(colnames(df), col_names)
temp2 <- df %>%
select(all_of(extra_names))
res <- t(apply(temp, 1, sort, decreasing=dec)) %>%
as_tibble %>%
set_colnames(col_names) %>%
bind_cols(temp2)
return(res)
}
col_names <- c("x", "y", "z")
df_s <- df %>%
sort_rows(col_names, dec=F)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
df_s
#> # A tibble: 10 × 4
#> x y z extra
#> <dbl> <dbl> <dbl> <chr>
#> 1 -1.21 1 10 A
#> 2 0.277 2 9 B
#> 3 1.08 3 8 C
#> 4 -2.35 4 7 D
#> 5 0.429 5 6 E
#> 6 0.506 5 6 F
#> 7 -0.575 4 7 G
#> 8 -0.547 3 8 H
#> 9 -0.564 2 9 I
#> 10 -0.890 1 10 J
Created on 2021-10-06 by the reprex package (v2.0.1)
This looks like sorting for me:
library(tidyverse)
df <- tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 x 3
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
extra_cols <- df %>% colnames() %>% setdiff(c("x", "y"))
extra_cols
#> [1] "extra"
df %>%
mutate(row = row_number()) %>%
pivot_longer(-c(row, extra_cols)) %>%
group_by_at(c("row", extra_cols)) %>%
transmute(
value = value %>% sort(),
name = c("x", "y"),
) %>%
pivot_wider() %>%
ungroup() %>%
select(-row)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(extra_cols)` instead of `extra_cols` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> # A tibble: 10 x 3
#> extra x y
#> <chr> <int> <int>
#> 1 A 1 10
#> 2 B 2 9
#> 3 C 3 8
#> 4 D 4 7
#> 5 E 5 6
#> 6 F 5 6
#> 7 G 4 7
#> 8 H 3 8
#> 9 I 2 9
#> 10 J 1 10
Created on 2021-10-06 by the reprex package (v2.0.1)
Try using apply on axis 1 and transpose it with t, then use as_tibble to convert it to a tibble.
Then finally change the column names:
> df <- as_tibble(t(apply(df, 1, sort)))
> names(df) <- c('x', 'y')
> df
# A tibble: 10 x 2
x y
<int> <int>
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10

Resources