R dplyr, distinct, unique combination of variables, with maximum value of third

R dplyr, distinct, unique combination of variables, with maximum value of third - r

I'm close but don't have the syntax correct. I'm trying to select all columns of a data table based on selection of unique combinations of two variables (columns) based on the maximum value of a third. MWE of progress thus far. Thx. J
library(dplyr)
dt1 <- tibble (var1 = c("num1", "num2", "num3", "num4", "num5"),
var2 = rep("A", 5),
var3 = c(rep("B", 2), rep("C", 3)),
var4 = c(5, 10, 3, 7, 19))
dt1 %>% distinct(var2, var3, max(var4), .keep_all = TRUE)
# A tibble: 2 x 5
var1 var2 var3 var4 `max(var4)`
<chr> <chr> <chr> <dbl> <dbl>
1 num1 A B 5 19
2 num3 A C 3 19
which is close, but I want the row where the value of var4 is the max value, within the unique combination of var2 and var3. I'm attempting to get:
# A tibble: 2 x 5
var1 var2 var3 var4 `max(var4)`
<chr> <chr> <chr> <dbl> <dbl>
1 num2 A B 5 10
2 num5 A C 3 19
Do I need a formula for the third argument of the distinct function?

We can add an arrange statement before the distinct
library(dplyr)
dt1 %>%
arrange(var2, var3, desc(var4)) %>%
distinct(var2, var3, .keep_all = TRUE)
-output
# A tibble: 2 x 4
var1 var2 var3 var4
<chr> <chr> <chr> <dbl>
1 num2 A B 10
2 num5 A C 19
Or another option is slice_max
dt1 %>%
group_by(var2, var3) %>%
mutate(var4new = first(var4)) %>%
slice_max(order_by= var4, n = 1) %>%
ungroup
-output
# A tibble: 2 x 5
var1 var2 var3 var4 var4new
<chr> <chr> <chr> <dbl> <dbl>
1 num2 A B 10 5
2 num5 A C 19 3

slice() will do what you want. Though you have drop "var4" = 5, 3 (not really sure if that is important)?
tibble (var1 = c("num1", "num2", "num3", "num4", "num5"),
var2 = rep("A", 5),
var3 = c(rep("B", 2), rep("C", 3)),
var4 = c(5, 10, 3, 7, 19)) %>%
group_by(var2, var3) %>%
slice(which.max(var4)) %>%
ungroup()
# A tibble: 2 x 4
var1 var2 var3 var4
<chr> <chr> <chr> <dbl>
1 num2 A B 10
2 num5 A C 19

Does this work:
library(dplyr)
dt1 %>% group_by(var2, var3) %>% filter(dense_rank(desc(var4)) == 1)
# A tibble: 2 x 4
# Groups: var2, var3 [2]
var1 var2 var3 var4
<chr> <chr> <chr> <dbl>
1 num2 A B 10
2 num5 A C 19

Related

Why does dplyr's coalesce(.) and fill(.) not work and still leave missing values?

I have a simple test dataset that has many repeating rows for participants. I want one row per participant that doesn't have NAs, unless the participant has NAs for the entire column. I tried grouping by participant name and then using coalesce(.) and fill(.), but it still leaves missing values. Here's my test dataset:
library(dplyr)
library(tibble)
test_dataset <- tibble(name = rep(c("Justin", "Corey", "Sibley"), 4),
var1 = c(rep(c(NA), 10), 2, 3),
var2 = c(rep(c(NA), 9), 2, 4, 6),
var3 = c(10, 15, 7, rep(c(NA), 9)),
outcome = c(3, 9, 23, rep(c(NA), 9)),
tenure = rep(c(10, 15, 20), 4))
And here's what I get when I use coalesce(.) or fill(., direction = "downup"), which both produce the same result.
library(dplyr)
library(tibble)
test_dataset_coalesced <- test_dataset %>%
group_by(name) %>%
coalesce(.) %>%
slice_head(n=1) %>%
ungroup()
test_dataset_filled <- test_dataset %>%
group_by(name) %>%
fill(., .direction="downup") %>%
slice_head(n=1) %>%
ungroup()
And here's what I want--note, there is one NA because that participant only has NA for that column:
library(tibble)
correct <- tibble(name = c("Justin", "Corey", "Sibley"),
var1 = c(NA, 2, 3),
var2 = c(2, 4, 6),
var3 = c(10, 15, 7),
outcome = c(3, 9, 23),
tenure = c(10, 15, 20))

You can group_by the name column, then fill the NA (you need to fill every column using everything()) with the non-NA values within the group, then only keep the distinct rows.
library(tidyverse)
test_dataset %>%
group_by(name) %>%
fill(everything(), .direction = "downup") %>%
distinct()
# A tibble: 3 × 6
# Groups: name [3]
name var1 var2 var3 outcome tenure
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Justin NA 2 10 3 10
2 Corey 2 4 15 9 15
3 Sibley 3 6 7 23 20

Try this
cleaned<- test_dataset |>
dplyr::group_by(name) |>
tidyr::fill(everything(),.direction = "downup") |>
unique()
# To filter out the ones with all NAs
cleaned[sum(is.na(cleaned[,-1]))<ncol(cleaned[,-1]),]
name var1 var2 var3 outcome tenure
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Justin NA 2 10 3 10
2 Corey 2 4 15 9 15
3 Sibley 3 6 7 23 20
``

Group when values in two columns are identical and calculate the mean

I got a dataset like this:
df1 <- data.frame(
var1 = c(1, 1, 1, 2),
var2 = c(1, 2, 2, 1),
value = c(1, 2, 3, 4))
I want to group rows in var1 and var2 and calculate the mean of value, and the condition is that when rows with the same var1 and var2 values will be grouped together (so it is not simply grouped by unique values in var1 and var2).
The output dataset will be this:
df2 <- data.frame(
var1 = c(1, 1, 2),
var2 = c(1, 2, 1),
value = c(1, 2.5, 4))
How can I do this?

Using aggregate().
aggregate(value ~ var1 + var2, df1, mean)
# var1 var2 value
# 1 1 1 1.0
# 2 2 1 4.0
# 3 1 2 2.5

You may try
library(dplyr)
df1 %>%
group_by(var1, var2) %>%
summarise(value = mean(value))
var1 var2 value
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2.5
3 2 1 4
group_by(var1, var2) will group both variable together.

conditionally summarize several variables by group

I want to conditionally summarize several variables by group. The following code does that, but I'm not sure how to do this without specifying each variable and the conditions in the summarize step.
library(tidyverse)
dat <- data.frame(group = c("A", "A", "A", "B", "B", "B"),
indicator = c(1, 2, 3, 1, 2, 3),
var1 = c(1, 0, 1, 2, 1, 2),
var2 = c(1, 0, 1, 1, 2, 1))
# dat
# group indicator var1 var2
#1 A 1 1 1
#2 A 2 0 0
#3 A 3 1 1
#4 B 1 2 1
#5 B 2 1 2
#6 B 3 2 1
dat %>%
group_by(group) %>%
summarise(var1 = sum(var1[indicator==1 | indicator==2]),
var2 = sum(var2[indicator==1 | indicator==2]))
# A tibble: 2 x 3
# group var1 var2
#* <chr> <dbl> <dbl>
#1 A 1 1
#2 B 3 3

Use across :
library(dplyr)
dat %>%
group_by(group) %>%
summarise(across(starts_with('var'), ~sum(.[indicator %in% 1:2])))
# group var1 var2
#* <chr> <dbl> <dbl>
#1 A 1 1
#2 B 3 3

filter infinite values and NAs in same call using dplyr::c_across and filter_if

I'm looking to filter dataframe rows with Inf and NA in the same call using filter with c_across and deprecated filter_if:
library(dplyr)
df <- tibble(a = c(1, 2, 3, NA, 1), b = c(5, Inf, 8, 8, 3), c = c(9, 10, Inf, 11, 12), d = c('a', 'b', 'c', 'd', 'e'), e = c(1, 2, 3, 4, -Inf))
# # A tibble: 5 x 5
# a b c d e
# <dbl> <dbl> <dbl> <chr> <dbl>
# 1 1 5 9 a 1
# 2 2 Inf 10 b 2
# 3 3 8 Inf c 3
# 4 NA 8 11 d 4
# 5 1 3 12 e -Inf
I could do this in two calls using either c_across or filter_if:
df %>%
rowwise %>%
filter(!any(is.infinite(c_across(where(is.numeric))))) %>%
filter(!any(is.na(c_across(where(is.numeric)))))
# # A tibble: 1 x 5
# # Rowwise:
# a b c d e
# <dbl> <dbl> <dbl> <chr> <dbl>
# 1 1 5 9 a 1
#OR filter_if:
df %>%
filter_if(~is.numeric(.), all_vars(!is.infinite(.))) %>%
filter_if(~is.numeric(.), all_vars(!is.na(.)))
# # A tibble: 1 x 5
# a b c d e
# <dbl> <dbl> <dbl> <chr> <dbl>
# 1 1 5 9 a 1
How would I do both approaches in one call to filter (and filter_if)? There may be an across approach too?
thanks

Try this. Use the where to identify your numeric columns.
df %>%
filter(across(.cols = where(is.numeric),
.fns = ~!is.infinite(.x) & !is.na(.x)))

I would suggest an approach with across() from dplyr:
library(dplyr)
#Data
df <- tibble(a = c(1, 2, 3, NA, 1),
b = c(5, Inf, 8, 8, 3),
c = c(9, 10, Inf, 11, 12),
d = c('a', 'b', 'c', 'd', 'e'),
e = c(1, 2, 3, 4, -Inf))
#Mutate
df %>% filter(across(c(a:e), ~ !is.na(.) & !is.infinite(.)))
Output:
# A tibble: 1 x 5
a b c d e
<dbl> <dbl> <dbl> <chr> <dbl>
1 1 5 9 a 1

Converting Uneven List to Dataframe

I am wondering if there is an elegant and generalizable way to convert mylist to mydf in the example below. I've looked at the rectangling vignette but the examples for unnest/hoist seem to be on lists that have a regular but not tidy structure.
mylist <- list(name = "example",
idnum = 123,
cases = list(
case1 = list(
type = 1,
genre = "A"),
case2 = list(
type = 1,
genre = "B"),
case3 = list(
type = 2,
genre = "A"
)))
mydf <- data.frame(name = rep("example", 3),
idnum = rep(123, 3),
cases = c("case1", "case2", "case3"),
type = c(1, 1, 2),
genre = c("A", "B", "A"))
Edit: This gets close to what I want but I am losing the case names
mylist %>%
as_tibble %>%
unnest_wider(cases)
# A tibble: 3 x 4
name idnum type genre
<chr> <dbl> <dbl> <chr>
1 example 123 1 A
2 example 123 1 B
3 example 123 2 A

One option is bind_rows with unnest_wider
library(dplyr)
library(tidyr)
bind_rows(mylist) %>%
mutate(case = names(cases)) %>%
unnest_wider(c(cases))
# A tibble: 3 x 5
# name idnum type genre case
# <chr> <dbl> <dbl> <chr> <chr>
#1 example 123 1 A case1
#2 example 123 1 B case2
#3 example 123 2 A case3
Or as #Ben suggested in the comments
mylist %>%
as_tibble %>%
mutate(case = names(cases)) %>%
unnest_wider(c(cases))
# A tibble: 3 x 5
# name idnum type genre case
# <chr> <dbl> <dbl> <chr> <chr>
#1 example 123 1 A case1
#2 example 123 1 B case2
#3 example 123 2 A case3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R dplyr, distinct, unique combination of variables, with maximum value of third - r

Does this work: library(dplyr) dt1 %>% group_by(var2, var3) %>% filter(dense_rank(desc(var4)) == 1) # A tibble: 2 x 4 # Groups: var2, var3 [2] var1 var2 var3 var4 <chr> <chr> <chr> <dbl> 1 num2 A B 10 2 num5 A C 19

Related

Why does dplyr's coalesce(.) and fill(.) not work and still leave missing values?

Group when values in two columns are identical and calculate the mean

conditionally summarize several variables by group

filter infinite values and NAs in same call using dplyr::c_across and filter_if

Converting Uneven List to Dataframe

Categories

Resources