dplyr `slice_max` interpolation not working - r

Given a data.frame:
library(tidyverse)
set.seed(0)
df <- tibble(A = 1:10, B = rnorm(10), C = rbinom(10,2,0.6))
var <- "B"
I'd like to get filter the data frame by the highest values of the variable in var. Logically, I'd do either:
df %>%
slice_max({{ var }}, n = 5)
#> # A tibble: 1 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 1 1.26 1
df %>%
slice_max(!! var, n = 5)
#> # A tibble: 1 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 1 1.26 1
But neither interpolation is working... what am I missing here?
Expected output would be the same as:
df %>%
slice_max(B, n = 5)
#> # A tibble: 5 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 10 2.40 0
#> 2 3 1.33 2
#> 3 4 1.27 1
#> 4 1 1.26 1
#> 5 5 0.415 2

I think you need to use the newer .data version as outlined here:
df %>%
slice_max(.data[[var]] , n = 5)
#> # A tibble: 5 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 10 2.40 0
#> 2 3 1.33 2
#> 3 4 1.27 1
#> 4 1 1.26 1
#> 5 5 0.415 2
I am puzzled by why your approach is get the first row only though!

We may convert to sym and evaluate (!!)
library(dplyr)
df %>%
slice_max(!! rlang::sym(var), n = 5)
-output
# A tibble: 5 × 3
A B C
<int> <dbl> <int>
1 10 2.40 0
2 3 1.33 2
3 4 1.27 1
4 1 1.26 1
5 5 0.415 2

Related

tidyr::expand_grid() not behaving as expected; what am I missing?

Here's my reprex:
library(tidyverse)
# make some data
a = tibble(b=1:2,c=2:1)
print(a)
#> # A tibble: 2 x 2
#> b c
#> <int> <int>
#> 1 1 2
#> 2 2 1
expand_grid(a) # doesn't produce the expected output
#> # A tibble: 2 x 2
#> b c
#> <int> <int>
#> 1 1 2
#> 2 2 1
# expected output achieved by:
(
a
%>% as.list()
%>% map(unique)
%>% cross_df()
)
#> # A tibble: 4 x 2
#> b c
#> <int> <int>
#> 1 1 2
#> 2 2 2
#> 3 1 1
#> 4 2 1
Created on 2021-08-17 by the reprex package (v2.0.0)
It is a different behavior compared to expand.grid from base R. But, the behavior is similar if we use do.call (or the similar one from purrr i.e. invoke - retired or exec )
library(purrr)
library(tidyr)
invoke(expand_grid, a)
exec(expand_grid, !!! a) # from #Mike Lawrence comments
-output
# A tibble: 4 x 2
b c
<int> <int>
1 1 2
2 1 1
3 2 2
4 2 1
i.e. basically, expand.grid can work on list directly
expand.grid(a)
expand.grid(unclass(a))
whereas it is different behavior
expand_grid(unclass(a))
# A tibble: 2 x 1
`unclass(a)`
<named list>
1 <int [2]>
2 <int [2]>

Join current and previous dataframes in nested tibble using lag() and mutate() to produce a new list-column

I am trying to determine the difference between the set of ids in subsequent pairs of dataframes. The dataframes are derived from an original dataframe split by a grouping variable representing the time period. The results should show the rows of the new ids that occur in the current time period compared to the previous one.
I can accomplish this with a list of dataframes:
library(tidyverse)
set.seed(999)
examp <- tibble(
id = c(replicate(4, sample.int(20, 9))),
year = rep(1:4, each = 9),
val = runif(36)
)
examp %>%
split(.$year) %>%
# note my default, I compare the first year to itself
map2(lag(., default = .[1]), anti_join, by = "id")
$`1`
# A tibble: 0 x 3
# ... with 3 variables: id <int>, year <int>, val <dbl>
$`2`
# A tibble: 3 x 3
id year val
<int> <int> <dbl>
1 5 2 0.450
2 11 2 0.943
3 2 2 0.571
$`3`
# A tibble: 6 x 3
id year val
<int> <int> <dbl>
1 19 3 0.870
2 12 3 0.403
3 9 3 0.331
4 20 3 0.315
5 16 3 0.455
6 17 3 0.699
$`4`
# A tibble: 5 x 3
id year val
<int> <int> <dbl>
1 4 4 0.190
2 11 4 0.0804
3 2 4 0.247
4 1 4 0.619
5 18 4 0.434
But I could not get the same to work using mutate in a nested dataframe:
examp %>%
nest_by(year) %>%
mutate(new = anti_join(data, lag(data), by = "id"))
# A tibble: 4 x 3
# Rowwise: year
year data new$id $val
<int> <list<tibble[,2]>> <int> <dbl>
1 1 [9 x 2] 3 0.0601
2 2 [9 x 2] 1 0.495
3 3 [9 x 2] 17 0.699
4 4 [9 x 2] 18 0.434
Here I could not figure out how to specify the default and the output is unexpected. I expected "new" to be a list-column of dataframes corresponding with those above, which I could then unnest.
I am interested in learning more about working with nested dataframes and any help understanding how to get this to work would be much appreciated. Additionally, if there is another (simple) solution to this general problem, I would be happy to learn about it.
It should be wrapped in a list
library(dplyr)
out <- examp %>%
nest_by(year) %>%
ungroup %>%
mutate(newdat = lag(data, default = data[1])) %>%
rowwise %>%
mutate(new = list(anti_join(data, newdat, by = 'id')))
-output
out$new
[[1]]
# A tibble: 0 x 2
# … with 2 variables: id <int>, val <dbl>
[[2]]
# A tibble: 3 x 2
id val
<int> <dbl>
1 5 0.450
2 11 0.943
3 2 0.571
[[3]]
# A tibble: 6 x 2
id val
<int> <dbl>
1 19 0.870
2 12 0.403
3 9 0.331
4 20 0.315
5 16 0.455
6 17 0.699
[[4]]
# A tibble: 5 x 2
id val
<int> <dbl>
1 4 0.190
2 11 0.0804
3 2 0.247
4 1 0.619
5 18 0.434

modify_at to remove NA values in each element in a list

I have a big list of small datasets like this:
>> my_list
[[1]]
# A tibble: 6 x 2
Year FIPS
<dbl> <chr>
1 2015 12001
2 2015 51013
3 2015 12081
4 2015 12115
5 2015 12127
6 2015 42003
[[2]]
# A tibble: 9 x 2
Year FIPS
<dbl> <chr>
1 2017 04013
2 2017 10003
3 2017 NA
4 2017 25005
5 2017 25009
6 2017 25013
7 2017 25017
8 2017 25021
9 2017 25027
...
I want to remove the NAs from each tibble using modify_at because looks like is a clean way to do it. This is my try:
my_list %>% modify_at(c("FIPS"), drop_na)
I tried also with na.omit, but I get the same error in both cases:
Error: character indexing requires a named object
Can anyone help me here, please? What I'm doing wrong?
Creating some data.
library(tidyverse)
mylist <-
list(tibble(a = c(1, 2, NA),
b = c(2, 2, 2)),
tibble(c = rep(1, 5),
d = sample(c(NA, 2), 5, replace = TRUE)))
The .at argument in purrr::modify_at() specifies the list element to modify, not the column within the dataframe nested in the list. purrr::modify() works for your purposes.
modify(mylist, drop_na)
#> [[1]]
#> # A tibble: 2 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#>
#> [[2]]
#> # A tibble: 4 x 2
#> c d
#> <dbl> <dbl>
#> 1 1 2
#> 2 1 2
#> 3 1 2
#> 4 1 2
purrr::map() also works. Since your input and output are both list objects, map() is sufficient here, while modify() would be preferred if your input is of another class than a regular list and you want to conserve that class attribute for the output.
map(mylist, drop_na)
#> [[1]]
#> # A tibble: 2 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#>
#> [[2]]
#> # A tibble: 4 x 2
#> c d
#> <dbl> <dbl>
#> 1 1 2
#> 2 1 2
#> 3 1 2
#> 4 1 2
base R
lapply(mylist, na.omit)
#> [[1]]
#> # A tibble: 2 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 2
#>
#> [[2]]
#> # A tibble: 4 x 2
#> c d
#> <dbl> <dbl>
#> 1 1 2
#> 2 1 2
#> 3 1 2
#> 4 1 2

Finding rowwise minimum and column index in a tibble

I have the following tibble:
> df <- tibble(
ID = LETTERS[1:4],
a = c(1,5,9,8),
b = c(5,9,8,2),
c = c(5,4,5,5)
)
> df
# A tibble: 4 x 4
ID a b c
<chr> <dbl> <dbl> <dbl>
1 A 1 5 5
2 B 5 9 4
3 C 9 8 5
4 D 8 2 5
>
What I want is to get the rowwise minimum of columns a:c and also the column index from this minimum.
The output tabel should look like this:
# A tibble: 4 x 6
ID a b c Min Col_Index
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 5 5 1 1
2 B 5 9 4 4 3
3 C 9 8 5 5 3
4 D 8 2 5 2 2
I don't want to use rowwise()!
Thank you!
You could use pmin with do.call to get rowwise minimum and negate the values to use with max.col to get the column index of minimum.
library(dplyr)
library(purrr)
df %>%
mutate(Min = do.call(pmin, select(., a:c)),
Col_Index = max.col(-select(., a:c)))
# ID a b c Min Col_Index
# <chr> <dbl> <dbl> <dbl> <dbl> <int>
#1 A 1 5 5 1 1
#2 B 5 9 4 4 3
#3 C 9 8 5 5 3
#4 D 8 2 5 2 2
Using purrr's pmap_dbl :
df %>%
mutate(Min = pmap_dbl(select(., a:c), ~min(c(...))),
Col_Index = pmap_dbl(select(., a:c), ~which.min(c(...))))
One option could be:
df %>%
rowwise() %>%
mutate(min = min(c_across(a:c)),
min_index = which.min(c_across(a:c)))
ID a b c min min_index
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 A 1 5 5 1 1
2 B 5 9 4 4 3
3 C 9 8 5 5 3
4 D 8 2 5 2 2
Base R solution:
setNames(cbind(df, t(apply(df[, vapply(df, is.numeric, logical(1))], 1, function(row) {
cbind(min(row), which.min(row))}))), c(names(df), "min", "col_index"))

Mutate multiple variable to create multiple new variables

Let's say I have a tibble where I need to take multiple variables and mutate them into new multiple new variables.
As an example, here is a simple tibble:
tb <- tribble(
~x, ~y1, ~y2, ~y3, ~z,
1,2,4,6,2,
2,1,2,3,3,
3,6,4,2,1
)
I want to subtract variable z from every variable with a name starting with "y", and mutate the results as new variables of tb. Also, suppose I don't know how many "y" variables I have. I want the solution to fit nicely within tidyverse / dplyr workflow.
In essence, I don't understand how to mutate multiple variables into multiple new variables. I'm not sure if you can use mutate in this instance? I've tried mutate_if, but I don't think I'm using it right (and I get an error):
tb %>% mutate_if(starts_with("y"), funs(.-z))
#Error: No tidyselect variables were registered
Thanks in advance!
Because you are operating on column names, you need to use mutate_at rather than mutate_if which uses the values within columns
tb %>% mutate_at(vars(starts_with("y")), funs(. - z))
#> # A tibble: 3 x 5
#> x y1 y2 y3 z
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 2 4 2
#> 2 2 -2 -1 0 3
#> 3 3 5 3 1 1
To create new columns, instead of overwriting existing ones, we can give name to funs
# add suffix
tb %>% mutate_at(vars(starts_with("y")), funs(mod = . - z))
#> # A tibble: 3 x 8
#> x y1 y2 y3 z y1_mod y2_mod y3_mod
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
# remove suffix, add prefix
tb %>%
mutate_at(vars(starts_with("y")), funs(mod = . - z)) %>%
rename_at(vars(ends_with("_mod")), funs(paste("mod", gsub("_mod", "", .), sep = "_")))
#> # A tibble: 3 x 8
#> x y1 y2 y3 z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
Edit: In dplyr 0.8.0 or higher versions, funs() will be deprecated (source1 & source2), need to use list() instead
tb %>% mutate_at(vars(starts_with("y")), list(~ . - z))
#> # A tibble: 3 x 5
#> x y1 y2 y3 z
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 2 4 2
#> 2 2 -2 -1 0 3
#> 3 3 5 3 1 1
tb %>% mutate_at(vars(starts_with("y")), list(mod = ~ . - z))
#> # A tibble: 3 x 8
#> x y1 y2 y3 z y1_mod y2_mod y3_mod
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
tb %>%
mutate_at(vars(starts_with("y")), list(mod = ~ . - z)) %>%
rename_at(vars(ends_with("_mod")), list(~ paste("mod", gsub("_mod", "", .), sep = "_")))
#> # A tibble: 3 x 8
#> x y1 y2 y3 z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
Edit 2: dplyr 1.0.0+ has across() function which simplifies this task even further
Basic usage
across() has two primary arguments:
The first argument, .cols, selects the columns you want to operate on.
It uses tidy selection (like select()) so you can pick variables by
position, name, and type.
The second argument, .fns, is a function or list of functions to apply to
each column. This can also be a purrr style formula (or list of formulas)
like ~ .x / 2. (This argument is optional, and you can omit it if you just want
to get the underlying data; you'll see that technique used in
vignette("rowwise").)
# Control how the names are created with the `.names` argument which
# takes a [glue](http://glue.tidyverse.org/) spec:
tb %>%
mutate(
across(starts_with("y"), ~ .x - z, .names = "mod_{col}")
)
#> # A tibble: 3 x 8
#> x y1 y2 y3 z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
tb %>%
mutate(
across(num_range(prefix = "y", range = 1:3), ~ .x - z, .names = "mod_{col}")
)
#> # A tibble: 3 x 8
#> x y1 y2 y3 z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 0 2 4
#> 2 2 1 2 3 3 -2 -1 0
#> 3 3 6 4 2 1 5 3 1
### Multiple functions
tb %>%
mutate(
across(c(matches("x"), contains("z")), ~ max(.x, na.rm = TRUE), .names = "max_{col}"),
across(c(y1:y3), ~ .x - z, .names = "mod_{col}")
)
#> # A tibble: 3 x 10
#> x y1 y2 y3 z max_x max_z mod_y1 mod_y2 mod_y3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 4 6 2 3 3 0 2 4
#> 2 2 1 2 3 3 3 3 -2 -1 0
#> 3 3 6 4 2 1 3 3 5 3 1
Created on 2018-10-29 by the reprex package (v0.2.1)

Resources