R+dplyr: conditionally swap the elements of two columns - r

Consider the dataframe df at the end of the post.
I simply would like to swap the elements of columns x and y whenever x>y.
There may be other columns in the dataframe which I do not want to touch.
In a sense, I would like to sort row wise the columns x and y.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df<-tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 × 3
#> # Rowwise:
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
Created on 2021-10-06 by the reprex package (v2.0.1)

base solution:
use which(df$x > df$y) to determine row numbers you want to change, then use rev to swap values for these:
df[which(df$x > df$y), c("x", "y")] <- rev(df[which(df$x > df$y), c("x", "y")])
df
# x y extra
# <int> <int> <chr>
# 1 1 10 A
# 2 2 9 B
# 3 3 8 C
# 4 4 7 D
# 5 5 6 E
# 6 5 6 F
# 7 4 7 G
# 8 3 8 H
# 9 2 9 I
# 10 1 10 J

Thanks everyone!
I wrote a small function which does what I need and generalizes to the case of multiple variables.
See the reprex
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(1234)
set_colnames <- `colnames<-`
df<-tibble(x=1:10, y=10:1, z=rnorm(10), extra=LETTERS[1:10]) %>%
rowwise()
df
#> # A tibble: 10 × 4
#> # Rowwise:
#> x y z extra
#> <int> <int> <dbl> <chr>
#> 1 1 10 -1.21 A
#> 2 2 9 0.277 B
#> 3 3 8 1.08 C
#> 4 4 7 -2.35 D
#> 5 5 6 0.429 E
#> 6 6 5 0.506 F
#> 7 7 4 -0.575 G
#> 8 8 3 -0.547 H
#> 9 9 2 -0.564 I
#> 10 10 1 -0.890 J
sort_rows <- function(df, col_names, dec=F){
temp <- df %>%
select(all_of(col_names))
extra_names <- setdiff(colnames(df), col_names)
temp2 <- df %>%
select(all_of(extra_names))
res <- t(apply(temp, 1, sort, decreasing=dec)) %>%
as_tibble %>%
set_colnames(col_names) %>%
bind_cols(temp2)
return(res)
}
col_names <- c("x", "y", "z")
df_s <- df %>%
sort_rows(col_names, dec=F)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
df_s
#> # A tibble: 10 × 4
#> x y z extra
#> <dbl> <dbl> <dbl> <chr>
#> 1 -1.21 1 10 A
#> 2 0.277 2 9 B
#> 3 1.08 3 8 C
#> 4 -2.35 4 7 D
#> 5 0.429 5 6 E
#> 6 0.506 5 6 F
#> 7 -0.575 4 7 G
#> 8 -0.547 3 8 H
#> 9 -0.564 2 9 I
#> 10 -0.890 1 10 J
Created on 2021-10-06 by the reprex package (v2.0.1)

This looks like sorting for me:
library(tidyverse)
df <- tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 x 3
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
extra_cols <- df %>% colnames() %>% setdiff(c("x", "y"))
extra_cols
#> [1] "extra"
df %>%
mutate(row = row_number()) %>%
pivot_longer(-c(row, extra_cols)) %>%
group_by_at(c("row", extra_cols)) %>%
transmute(
value = value %>% sort(),
name = c("x", "y"),
) %>%
pivot_wider() %>%
ungroup() %>%
select(-row)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(extra_cols)` instead of `extra_cols` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> # A tibble: 10 x 3
#> extra x y
#> <chr> <int> <int>
#> 1 A 1 10
#> 2 B 2 9
#> 3 C 3 8
#> 4 D 4 7
#> 5 E 5 6
#> 6 F 5 6
#> 7 G 4 7
#> 8 H 3 8
#> 9 I 2 9
#> 10 J 1 10
Created on 2021-10-06 by the reprex package (v2.0.1)

Try using apply on axis 1 and transpose it with t, then use as_tibble to convert it to a tibble.
Then finally change the column names:
> df <- as_tibble(t(apply(df, 1, sort)))
> names(df) <- c('x', 'y')
> df
# A tibble: 10 x 2
x y
<int> <int>
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10

Related

R + dplyr: Partial Deduplication of Rows in a Tibble

A very common question is how to remove all the duplicated lines in a data frame in R, something which can be done with a variety of tools (I like dplyr+distinct).
However, what if your dataset contains several duplicated lines, but you do not want to remove all of them, but only those for some combination of the variables?
I do not know how to achieve that, so any suggestion is welcome.
Please have a look at the reprex at the end of the post.
Thanks!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
df
#> # A tibble: 15 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A h
#> 7 2 B j
#> 8 3 C k
#> 9 4 D t
#> 10 5 E u
#> 11 1 A A
#> 12 2 B B
#> 13 3 C C
#> 14 4 D D
#> 15 5 E E
df_ded <- df |>
distinct()
df_ded
#> # A tibble: 10 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A A
#> 7 2 B B
#> 8 3 C C
#> 9 4 D D
#> 10 5 E E
## I want to deduplicate only the rows with x==3 and z=="k"
df_ded_partial <- df |>
distinct(x==3, z=="k") ## but this is not what I mean.
## How to achieve it?
df_ded_partial
#> # A tibble: 3 × 2
#> `x == 3` `z == "k"`
#> <lgl> <lgl>
#> 1 FALSE FALSE
#> 2 TRUE TRUE
#> 3 TRUE FALSE
Created on 2023-02-14 with reprex v2.0.2
We can use group_modify() and check for the condition using the .y argument which is a tibble of the current group. So we can say: if the condition is met return the distinct(.x) group otherwise return the whole group .x.
library(dplyr)
df |>
group_by(x, z) |>
group_modify(~ if(.y$x == 3 && .y$z == "k") distinct(.x) else .x)
#> # A tibble: 14 x 3
#> # Groups: x, z [10]
#> x z y
#> <int> <chr> <chr>
#> 1 1 A A
#> 2 1 h A
#> 3 1 h A
#> 4 2 B B
#> 5 2 j B
#> 6 2 j B
#> 7 3 C C
#> 8 3 k C
#> 9 4 D D
#> 10 4 t D
#> 11 4 t D
#> 12 5 E E
#> 13 5 u E
#> 14 5 u E
Data from OP
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
Created on 2023-02-14 by the reprex package (v2.0.1)

Keep Top Factors based on Grouped Weight

Please have a look at the snippet at the end of the post.
I am essentially looking for a cleaner way to obtain the same result.
I have a tibble where the x column is a character vector (I did not translate it into a factor, but this is actually what it is).
Each factor appears multiple times and it always has an associated numerical value (the w column in the tibble).
I would like to keep the top 4 factors according to the sum of their associated w values and change everything else into an "other" factor.
I achieve it below, but I wonder if there is a smarter way to do the same using e.g. forcats.
Any suggestion is appreciated
library(tidyverse)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)
It appears that I can pass a weight argument to fct_lump_n() so this works
library(tidyverse)
library(forcats)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
df_new2 <- df |>
mutate(x2=fct_lump_n(x,4, w))
df_new2
#> # A tibble: 100 × 3
#> x w x2
#> <chr> <int> <fct>
#> 1 a 1 Other
#> 2 b 2 Other
#> 3 c 3 Other
#> 4 d 4 Other
#> 5 e 5 Other
#> 6 f 6 Other
#> 7 g 7 g
#> 8 h 8 h
#> 9 i 9 i
#> 10 j 10 j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)

How can I change the label of row structure into a string?

I'm trying to change the type of structure in row label (the one with red rectangle) into a string(character). Any ideas/suggestion of how can I change it?
Set the rownames() for the data.frame. You might also find the rownames_to_column(), rowid_to_column(), and column_to_rownames() functions from the {tibble} package useful:
dat <- data.frame(x = 1:26)
head(dat)
#> x
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
rownames(dat) <- letters
head(dat)
#> x
#> a 1
#> b 2
#> c 3
#> d 4
#> e 5
#> f 6
tibble::rownames_to_column(dat, var = "rowname") |>
head()
#> rowname x
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
tibble::rowid_to_column(dat, var = "rowid") |>
head()
#> rowid x
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
dat <- data.frame(x = 1:26, rowname = letters)
head(dat)
#> x rowname
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
tibble::column_to_rownames(dat, var = "rowname") |>
head()
#> x
#> a 1
#> b 2
#> c 3
#> d 4
#> e 5
#> f 6
Created on 2022-07-22 by the reprex package (v2.0.1)

collapse a dataframe in R that contains both numeric and character variables

I have the following data.frame:
data <- data.frame("ag" = rep(LETTERS[1:4],6),
"date" = c(sapply(1:3, function(x) rep(x, 8))),
"num_var1"= 1:24,
"num_var2"= 24:1,
"alpha_var1" = LETTERS[1:24],
"alpha_var2" = LETTERS[25:2] )
and I would like to summarize (mean) its rows by ag and date using dplyr. The issue is that some rows include characters: in this case, I would like to get the first entry by group (the example dataset is already sorted).
Since my dataset has several entries, I would like the code to be able to recognize whether a variable is numeric (including integers) or a character. However, the best solution that I have so far is the following one:
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), mean))
which creates NAs for non-numeric variables. Do you have a better solution?
Is this what you are looking for?
library(dplyr)
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), ~
if(is.numeric(.x)) mean(.x) else first(.x)))
#> `summarise()` has grouped output by 'ag'. You can override using the `.groups` argument.
#> # A tibble: 12 x 6
#> # Groups: ag [4]
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F
Created on 2022-03-03 by the reprex package (v2.0.1)
Another possible solution:
library(tidyverse)
data %>%
group_by(ag, date) %>%
summarise(across(where(is.numeric), mean),
across(where(is.character), first), .groups = "drop")
#> # A tibble: 12 × 6
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F

How to propotionally split data using initial_split r

I would like to proportionally split the data I have. For example, I have 100 rows and I want to randomly sample 1 row every two rows. Using tidymodels rsample I assumed I would do the below.
dat <- as_tibble(seq(1:100))
split <- inital_split(dat, prop = 0.5, breaks = 50)
testing <- testing(split)
When checking the data the split hasnt done what I thought it would. It seems close but not exactly. I thought the breaks call generates bins which are sampled from. So, breaks = 50 would split the the 100 rows into 50 bins, therefore having two rows per bin. I have also tried strata = value to strafy accross the rows but I cannot get this to work either.
I am using this as an exaple but I am also curious how this would work when sampling 1 row every four etc.
Have I miss understood the breaks call function?
There is an argument that protects users from trying to create stratified splits that are too small that you are running up against; it's called pool:
library(rsample)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dat <- tibble(value = seq(1:100), strat = as.factor(rep(1:50, each = 2)))
dat
#> # A tibble: 100 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 2 1
#> 3 3 2
#> 4 4 2
#> 5 5 3
#> 6 6 3
#> 7 7 4
#> 8 8 4
#> 9 9 5
#> 10 10 5
#> # … with 90 more rows
split <- initial_split(dat, prop = 0.5, strata = strat, pool = 0.0)
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
split
#> <Analysis/Assess/Total>
#> <50/50/100>
training(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 4 2
#> 3 5 3
#> 4 8 4
#> 5 10 5
#> 6 12 6
#> 7 13 7
#> 8 16 8
#> 9 17 9
#> 10 20 10
#> # … with 40 more rows
testing(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 2 1
#> 2 3 2
#> 3 6 3
#> 4 7 4
#> 5 9 5
#> 6 11 6
#> 7 14 7
#> 8 15 8
#> 9 18 9
#> 10 19 10
#> # … with 40 more rows
Created on 2022-02-22 by the reprex package (v2.0.1)
We really don't recommend turning pool down to zero like this, but you can do it here to see how the strata and prop arguments work.

Resources