Finding rowwise minimum and column index in a tibble - r

I have the following tibble:
> df <- tibble(
ID = LETTERS[1:4],
a = c(1,5,9,8),
b = c(5,9,8,2),
c = c(5,4,5,5)
)
> df
# A tibble: 4 x 4
ID a b c
<chr> <dbl> <dbl> <dbl>
1 A 1 5 5
2 B 5 9 4
3 C 9 8 5
4 D 8 2 5
>
What I want is to get the rowwise minimum of columns a:c and also the column index from this minimum.
The output tabel should look like this:
# A tibble: 4 x 6
ID a b c Min Col_Index
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 5 5 1 1
2 B 5 9 4 4 3
3 C 9 8 5 5 3
4 D 8 2 5 2 2
I don't want to use rowwise()!
Thank you!

You could use pmin with do.call to get rowwise minimum and negate the values to use with max.col to get the column index of minimum.
library(dplyr)
library(purrr)
df %>%
mutate(Min = do.call(pmin, select(., a:c)),
Col_Index = max.col(-select(., a:c)))
# ID a b c Min Col_Index
# <chr> <dbl> <dbl> <dbl> <dbl> <int>
#1 A 1 5 5 1 1
#2 B 5 9 4 4 3
#3 C 9 8 5 5 3
#4 D 8 2 5 2 2
Using purrr's pmap_dbl :
df %>%
mutate(Min = pmap_dbl(select(., a:c), ~min(c(...))),
Col_Index = pmap_dbl(select(., a:c), ~which.min(c(...))))

One option could be:
df %>%
rowwise() %>%
mutate(min = min(c_across(a:c)),
min_index = which.min(c_across(a:c)))
ID a b c min min_index
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 A 1 5 5 1 1
2 B 5 9 4 4 3
3 C 9 8 5 5 3
4 D 8 2 5 2 2

Base R solution:
setNames(cbind(df, t(apply(df[, vapply(df, is.numeric, logical(1))], 1, function(row) {
cbind(min(row), which.min(row))}))), c(names(df), "min", "col_index"))

Related

Subset row if column value for any of multiple columns equals value in a list

I have a data frame with ten columns, but five columns of concern: A, B, C, D, E. I also have a list of values. What's the best way to subset the rows whose values in column A, B, C, D, OR, E is included in the list of values?
If I were only concerned with a single column, I know I can use left_join(list_of_values, df$A) but I'm not sure how to do something similar with multiple columns.
The key here is if_any.
library(tidyverse)
set.seed(26)
sample_df <- tibble(col = rep(LETTERS[1:8], each = 5),
val = sample(1:10, 40, replace = TRUE),
ID = rep(1:5, 8)) |>
pivot_wider(names_from = col, values_from = val)
sample_df
#> # A tibble: 5 x 9
#> ID A B C D E F G H
#> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 8 4 10 7 2 7 4 3
#> 2 2 3 2 3 3 4 10 2 3
#> 3 3 9 6 6 8 2 10 10 3
#> 4 4 7 6 8 9 3 5 8 3
#> 5 5 6 3 4 1 9 7 9 1
vals <- c(1, 7)
#solution
sample_df |>
filter(if_any(A:E, ~. %in% vals))
#> # A tibble: 3 x 9
#> ID A B C D E F G H
#> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 8 4 10 7 2 7 4 3
#> 2 4 7 6 8 9 3 5 8 3
#> 3 5 6 3 4 1 9 7 9 1
or any and apply with base R:
#base solution
indx <- apply(sample_df[,which(colnames(sample_df) %in% LETTERS[1:5])], 1, \(x) any(x %in% vals))
sample_df[indx,]
#> # A tibble: 3 x 9
#> ID A B C D E F G H
#> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 8 4 10 7 2 7 4 3
#> 2 4 7 6 8 9 3 5 8 3
#> 3 5 6 3 4 1 9 7 9 1

Multiply every new rows created by `separate_rows`

I am using the separate_rows function from tidyr.
Essentially, I would like to change the value of the data that is copied -- in the example below, it would read: "everytime a new row is created, multiply z by 0.5"
I already added an index in the default df. so it could be "everytime the index N is the same as [-1], multiply z by 0.5"
df <- tibble(
x = 1:4,
y = c("a", "b,c,d", "e,f"),
z = 1:4
)
# A tibble: 3 x 3
x y z
<int> <chr> <int>
1 1 a 1
2 2 b,c,d 2
3 3 e,f 3
what we get:
> separate_rows(df, y)
# A tibble: 6 x 3
x y z
<int> <chr> <int>
1 1 a 1
2 2 b 2
3 2 c 2
4 2 d 2
5 3 e 3
6 3 f 3
what I would need (the z values that have a new row multipled by 0.5:
# A tibble: 6 x 3
x y z
<int> <chr> <int>
1 1 a 1
2 2 b 1
3 2 c 1
4 2 d 1
5 3 e 1.5
6 3 f 1.5
You can group by z and multiply if n > 1.
df %>%
separate_rows(y) %>%
group_by(z) %>%
mutate(z = ifelse(n() > 1, z*0.5, z))
x y z
<int> <chr> <dbl>
1 1 a 1
2 2 b 1
3 2 c 1
4 2 d 1
5 3 e 1.5
6 3 f 1.5
An option is also to multiply 'z' by 0.5, get the pmax with 1 and then use separate_rows
library(dplyr)
library(tidyr)
df %>%
mutate(z = pmax(1, z * 0.5)) %>%
separate_rows(y)
-output
# A tibble: 6 × 3
x y z
<int> <chr> <dbl>
1 1 a 1
2 2 b 1
3 2 c 1
4 2 d 1
5 3 e 1.5
6 3 f 1.5

Is there a way to get subdataframes with purrr in magrittr pipes workflow without using data.frame name?

That is, I was interested in doing the same as in the example, but with purrr functions.
tibble(a, b = a * 2, c = 1) %>%
{lapply(X = names(.), FUN = function(.x) select(., 1:.x))}
[[1]]
# A tibble: 5 x 1
a
<int>
1 1
2 2
3 3
4 4
5 5
[[2]]
# A tibble: 5 x 2
a b
<int> <dbl>
1 1 2
2 2 4
3 3 6
4 4 8
5 5 10
[[3]]
# A tibble: 5 x 3
a b c
<int> <dbl> <dbl>
1 1 2 1
2 2 4 1
3 3 6 1
4 4 8 1
5 5 10 1
I only could do it if I named foo <- tibble(a, b = a * 2, c = 1) and inside map I did select(foo, ...), but I wanted to avoid that, since I wanted to mutate the named dataframe in pipe workflow.
Thank you!
You can use map in the following way :
library(dplyr)
library(purrr)
tibble(a = 1:5, b = a * 2, c = 1) %>%
{map(names(.), function(.x) select(., 1:.x))}
Based on your actual use case you can also use imap which will pass column value (.x) along with it's name (.y).
tibble(a = 1:5, b = a * 2, c = 1) %>%
imap(function(.x, .y) select(., 1:.y))
#$a
# A tibble: 5 x 1
# a
# <int>
#1 1
#2 2
#3 3
#4 4
#5 5
#$b
# A tibble: 5 x 2
# a b
# <int> <dbl>
#1 1 2
#2 2 4
#3 3 6
#4 4 8
#5 5 10
#$c
# A tibble: 5 x 3
# a b c
# <int> <dbl> <dbl>
#1 1 2 1
#2 2 4 1
#3 3 6 1
#4 4 8 1
#5 5 10 1

Overwrite left_join dplyr to update data

My question is similar to this one however I have additional columns in the LHS that should be kept https://stackoverflow.com/a/35642948/9285732
y is a subset of x with updated values for val1. In x I want to overwrite the relevant values but keep the rest.
Sample data:
library(tidyverse)
x <- tibble(name = c("hans", "dieter", "bohlen", "hans", "dieter", "alf"),
location = c(1,1,1,2,2,3),
val1 = 1:6, val2 = 1:6, val3 = 1:6)
y <- tibble(name = c("hans", "dieter", "hans"),
location = c(2,2,1),
val1 = 10)
> x
# A tibble: 6 x 5
name location val1 val2 val3
<chr> <dbl> <int> <int> <int>
1 hans 1 1 1 1
2 dieter 1 2 2 2
3 bohlen 1 3 3 3
4 hans 2 4 4 4
5 dieter 2 5 5 5
6 alf 3 6 6 6
> y
# A tibble: 3 x 3
name location val1
<chr> <dbl> <dbl>
1 hans 2 10
2 dieter 2 10
3 hans 1 10
> # desired output
> out
# A tibble: 6 x 5
name location val1 val2 val3
<chr> <dbl> <dbl> <int> <int>
1 hans 1 10 1 1
2 dieter 1 2 2 2
3 bohlen 1 3 3 3
4 hans 2 10 4 4
5 dieter 2 10 5 5
6 alf 3 6 6 6
I wrote a function that is doing what I want, however it's quite cumbersome. I wonder if there's a more elegant way or even a dplyr function that I'm unaware of.
overwrite_join <- function(x, y, by = NULL){
bycols <- which(colnames(x) %in% by)
commoncols <- which(colnames(x) %in% colnames(y))
extracols <- which(!(colnames(x) %in% colnames(y)))
x1 <- anti_join(x, y, by = by) %>%
bind_rows(y) %>%
select(commoncols) %>%
left_join(x %>% select(bycols, extracols), by = by)
out <- x %>% select(by) %>%
left_join(x1, by = by)
return(out)
}
overwrite_join(t1, t2, by = c("name", "location"))
You could do something along the lines of
> x %>%
left_join(y = y, by = c("name", "location")) %>%
within(., val1.x <- ifelse(!is.na(val1.y), val1.y, val1.x)) %>%
select(-val1.y)
# # A tibble: 6 x 5
# name location val1.x val2 val3
# <chr> <dbl> <dbl> <int> <int>
# 1 hans 1 10 1 1
# 2 dieter 1 2 2 2
# 3 bohlen 1 3 3 3
# 4 hans 2 10 4 4
# 5 dieter 2 10 5 5
# 6 alf 3 6 6 6
and then rename val1.x.
My package safejoin might help. Only available on github so far but has a feature designed just for that.
The conflict argument below must be fed a function or lambda to deal with conflicting columns when joining, here we want in priority a value from the y data frame so we can use dplyr::coalesce() there. Note that we must first coerce y$val1 as in your example it's double while x$val1 is integer. Your real case might not need this step.
# remotes::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
y$val1 <- as.integer(y$val1)
safe_left_join(x, y, by = c("name", "location"), conflict = ~coalesce(.y, .x))
#> # A tibble: 6 x 5
#> name location val1 val2 val3
#> <chr> <dbl> <int> <int> <int>
#> 1 hans 1 10 1 1
#> 2 dieter 1 2 2 2
#> 3 bohlen 1 3 3 3
#> 4 hans 2 10 4 4
#> 5 dieter 2 10 5 5
#> 6 alf 3 6 6 6
Edit : inspired by your own solution here's a 100% dplyr option that you might like better, just like your option though it's not a proper join!
bind_rows(y, x) %>%
group_by(name, location) %>%
summarize_all(~na.omit(.x)[[1]]) %>%
ungroup()
#> # A tibble: 6 x 5
#> name location val1 val2 val3
#> <chr> <dbl> <dbl> <int> <int>
#> 1 alf 3 6 6 6
#> 2 bohlen 1 3 3 3
#> 3 dieter 1 2 2 2
#> 4 dieter 2 10 5 5
#> 5 hans 1 10 1 1
#> 6 hans 2 10 4 4
Try dplyr::coalesce
x %>%
left_join(y, by = c("name", "location")) %>%
mutate(val1 = coalesce(val1.y, val1.x)) %>%
select(-val1.x, -val1.y)
# A tibble: 6 x 5
name location val2 val3 val1
<chr> <dbl> <int> <int> <int>
1 hans 1 1 1 10
2 dieter 1 2 2 2
3 bohlen 1 3 3 3
4 hans 2 4 4 10
5 dieter 2 5 5 10
6 alf 3 6 6 6
This is the idiom I now use. It does not preserve the row or column order in x, if that is important.
I like it because I can evaluate the values to just before the bind_rows(), do a visual inspection, and if I like it, put the fixed rows back onto the base dataframe.
library(dplyr)
x <- tibble(name = c("hans", "dieter", "bohlen", "hans", "dieter", "alf"),
location = c(1,1,1,2,2,3),
val1 = 1:6, val2 = 1:6, val3 = 1:6)
y <- tibble(name = c("hans", "dieter", "hans"),
location = c(2,2,1),
val1 = 10)
keys <- c("name", "location")
out <- x %>%
semi_join(y, keys) %>%
select(-matches(setdiff(names(y), keys))) %>%
left_join(y, keys) %>%
bind_rows(x %>% anti_join(y, keys))
out %>%
print()
#> # A tibble: 6 x 5
#> name location val2 val3 val1
#> <chr> <dbl> <int> <int> <dbl>
#> 1 hans 1 1 1 10
#> 2 hans 2 4 4 10
#> 3 dieter 2 5 5 10
#> 4 dieter 1 2 2 2
#> 5 bohlen 1 3 3 3
#> 6 alf 3 6 6 6
Created on 2019-12-12 by the reprex package (v0.3.0)

bootstrap by group in tibble

Suppose I have a tibble tbl_
tbl_ <- tibble(id = c(1,1,2,2,3,3), dta = 1:6)
tbl_
# A tibble: 6 x 2
id dta
<dbl> <int>
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 3 6
There are 3 id groups. I want to resample entire id groups 3 times with replacement. For example the resulting tibble can be:
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 1 2
5 3 5
6 3 6
but not
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 2 4
5 3 5
6 3 6
or
id dta
<dbl> <int>
1 1 1
2 1 1
3 2 3
4 2 4
5 3 5
6 3 6
Here is one option with sample_n and distinct
library(tidyverse)
distinct(tbl_, id) %>%
sample_n(nrow(.), replace = TRUE) %>%
pull(id) %>%
map_df( ~ tbl_ %>%
filter(id == .x)) %>%
arrange(id)
# A tibble: 6 x 2
# id dta
# <dbl> <int>
#1 1.00 1
#2 1.00 2
#3 1.00 1
#4 1.00 2
#5 3.00 5
#6 3.00 6
An option can be to get the minimum row number for each id. That row number will be used to generate random samples from wiht replace = TRUE.
library(dplyr)
tbl_ %>% mutate(rn = row_number()) %>%
group_by(id) %>%
summarise(minrow = min(rn)) ->min_row
indx <- rep(sample(min_row$minrow, nrow(min_row), replace = TRUE), each = 2) +
rep(c(0,1), 3)
tbl_[indx,]
# # A tibble: 6 x 2
# id dta
# <dbl> <int>
# 1 1.00 1
# 2 1.00 2
# 3 3.00 5
# 4 3.00 6
# 5 2.00 3
# 6 2.00 4
Note: In the above answer the number of rows for each id has been assumed as 2 but this answer can tackle any number of IDs. The hard-coded each=2 and c(0,1) needs to be modified in order to scale it up to handle more than 2 rows for each id

Resources