Assign most common value of factor variable with summarize in R - r

R noob here, working in tidyverse / RStudio.
I have a categorical / factor variable that I'd like to retain in a group_by/summarize workflow. I'd like to summarize it using a summary function that returns the most common value of that factor within each group.
Is there a summary function I can use for this?
mean returns NA, median only works with numeric data, and summary gives me separate rows with counts of each factor level instead of the most common level.
Edit: example using subset of mtcars dataset:
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147. 62 3.69 3.19 20 1 0 4 2
22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
Here I have converted carb into a factor variable. In this subset of the data, you can see that among 6-cylinder cars there are 3 with carb=4 and 1 with carb=1; similarly among 4-cylinder cars there are 2 with carb=2 and 1 with carb=1.
So if I do:
data %>% group_by(cyl) %>% summarise(modalcarb = FUNC(carb))
where FUNC is the function I'm looking for, I should get:
cyl carb
<dbl> <fct>
4 2
6 4
8 2 # there are multiple potential ways of handling multi-modal situations, but that's secondary here
Hope that makes sense!

You could use the function fmode of collapse to calculate the mode. Here I created a reproducible example using mtcars dataset where the cyl column is your factor variable to group on like this:
library(dplyr)
library(collapse)
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
group_by(cyl) %>%
summarise(mode = fmode(am))
#> # A tibble: 3 × 2
#> cyl mode
#> <fct> <dbl>
#> 1 4 1
#> 2 6 0
#> 3 8 0
Created on 2022-11-24 with reprex v2.0.2

We could use which.max after count:
library(dplyr)
# fake dataset
x <- mtcars %>%
mutate(cyl = factor(cyl)) %>%
select(cyl)
x %>%
count(cyl) %>%
slice(which.max(n))
cyl n
<fct> <int>
1 8 14

You can use which.max to index and table to count.
library(tidyverse)
mtcars |>
group_by(cyl) |>
summarise(modalcarb = carb[which.max(table(carb))])
#> # A tibble: 3 x 2
#> cyl modalcarb
#> <dbl> <dbl>
#> 1 4 2
#> 2 6 4
#> 3 8 3

Related

How to filter inside only certain groups (that satisfy a particular condition) in grouped tibble, using dplyr?

Using mtcars dataset, as an example. I would like to:
group table based on the number of cylinders
within each group test whether any car has miles per gallon higher than 25 ( mpg > 25)
for only those groups that have at least one car with mpg > 25, I would like to remove the cars that have mpg < 20
The expected output is cars that belong to a cylinder group with at least one other car having mpg > 25, and that themselves have mpg < 20 are removed from dataset
PS: I can think of several ways to address this problem, but I wanted to see if someone can come up with straightforward and elegant solution, e.g.
xx <- split (mtcars, f = mtcars$cyl)
for (i in seq_along (xx)){
if (any (xx[[i]]$mpg) > 25) xx[[i]] <- filter (xx[[i]] > 20)
}
xx <- bind_rows (xx)
Maybe this ?
library(dplyr)
mtcars %>%
group_by(cyl) %>%
filter(if(any(mpg > 25)) mpg > 20 else TRUE) %>%
ungroup
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
From the groups which has at least one value greater than 25 in mpg, we keep only the rows that has values greater than 20. If a group has no value greater than 25 keep all the rows of those groups.
We can use
library(dplyr)
mtcars %>%
group_by(cyl) %>%
filter(any(mpg > 25) & mpg > 20) %>%
ungroup

step mutate fixed value to a list of variables in tidymodels

I wonder if could be possible to mutate variables inside my recipe taking a list of variables and imputing a fixed value (-12345) when NA is found.
No success so far.
my_list <- c("impute1", "impute2", "impute3")
recipe <-
recipes::recipe(target ~ ., data = data_train) %>%
recipes::step_naomit(everything(), skip = TRUE) %>%
recipes::step_rm(c(v1, v2, id, id2 )) %>%
recipes::step_mutate_at(my_list, if_else(is.na(.), -12345, . ))
Error in step_mutate_at_new(terms = ellipse_check(...), fn = fn, trained = trained, :
argument "fn" is missing, with no default
You were on the right track. A couple of notes. to make recipes::step_mutate_at() work you need 2 things. A selection of variables to be transformed and 1 or more functions to apply to that selection. The functions should be passed to the fn argument either as a function, named or anonymous, or a named list of functions.
Setting fn = ~if_else(is.na(.), -12345, . ) in step_mutate_at() should fix your problem, using the ~fun(.) lambda style. Furthermore i used all_of(my_list) instead of my_list to avoid ambiguous selection by using external vectors reference.
Lastly using step_naomit() removes the observations with missing values during baking which might be undesirable since you are imputing the missing values.
library(recipes)
mtcars1 <- mtcars
mtcars1[1, 1:3] <- NA
my_list <- c("mpg", "cyl", "disp")
recipe <-
recipe(drat ~ ., data = mtcars1) %>%
step_mutate_at(all_of(my_list), fn = ~if_else(is.na(.), -12345, . ))
recipe %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 32 x 11
#> mpg cyl disp hp wt qsec vs am gear carb drat
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -12345 -12345 -12345 110 2.62 16.5 0 1 4 4 3.9
#> 2 21 6 160 110 2.88 17.0 0 1 4 4 3.9
#> 3 22.8 4 108 93 2.32 18.6 1 1 4 1 3.85
#> 4 21.4 6 258 110 3.22 19.4 1 0 3 1 3.08
#> 5 18.7 8 360 175 3.44 17.0 0 0 3 2 3.15
#> 6 18.1 6 225 105 3.46 20.2 1 0 3 1 2.76
#> 7 14.3 8 360 245 3.57 15.8 0 0 3 4 3.21
#> 8 24.4 4 147. 62 3.19 20 1 0 4 2 3.69
#> 9 22.8 4 141. 95 3.15 22.9 1 0 4 2 3.92
#> 10 19.2 6 168. 123 3.44 18.3 1 0 4 4 3.92
#> # … with 22 more rows
Created on 2021-06-21 by the reprex package (v2.0.0)

How do I specify a range of columns in a case_when statemet to check a condition in R?

Given the tibble -
library(tidyverse)
df <- mtcars %>% as_tibble() %>% slice(1:5)
df
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
I know you can use something like df %>% select(c(mpg:vs)) to select a range of columns without typing all the column names in the select statement. How do I do something similar in a case_when statement. I have a dataset with about 35 columns and want to flag rows where all columns are equal to 0.
We can use where condition in select
df %>%
select(where(~ all(. %in% c(0, 1))))
-output
# A tibble: 5 x 2
vs am
<dbl> <dbl>
1 0 1
2 0 1
3 1 1
4 1 0
5 0 0
If we want to create a new column 'flag' that checks if all the column values are 0 for a particular row
df %>%
mutate(new = !rowSums(cur_data() != 0))
I'm not sure if we can use case_when about this problem, but we can use the following solution:
library(dplyr)
df %>%
rowwise() %>%
mutate(flag = +(all(c_across(where(is.numeric)) == 0)))
# A tibble: 5 x 12
# Rowwise:
mpg cyl disp hp drat wt qsec vs am gear carb flag
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 0
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 0
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 0
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0

How to effectively append datasets using a dictionary (with R/dplyr)? / How to coalesce 'all columns with duplicate names'?

I have a series of data sets and a dictionary to bring these together. But I'm struggling to figure out how to automate this.
Suppose this data and dictionary (actual one is much longer, thus I want to automate):
mtcarsA <- mtcars[1:5,] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[6:10,] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
dic <- tibble(true_name = c("mpg_true", "cyl_true"),
nameA = c("mpgA", "cyl_A"),
nameB = c("mpg_B", "B_cyl")
)
I want these datasets (from years A and B) appended to one another, and then to have the names changed or coalesced to the 'true_name' values.
I can bring the data sets together into mtcars_all, and then I tried recoding the column names with the dictionary as follows
mtcars_all <- bind_rows((mtcarsA, mtcarsB)
recode_colname <- function(df, tn=dic$true_name, fname){
colnames(df) <- dplyr::recode(colnames(df),
!!!setNames(as.character(tn), fname))
return(df)
}
mtcars_all <- mtcars_all %>%
recode_colname(fname=dic$nameA) %>%
recode_colname(fname=dic$nameB)
But then I get duplicate columns. Of course I could coalesce each of these duplicate columns by name, but there will be many of these in my real case, so I want to automate 'coalesce all columns with duplicate names'.
I'm giving the entire problem here because perhaps someone also has a better solution for 'using a data dictionary'.
You can create a named vector to replace column names.
library(tidyverse)
pmap(dic, ~setNames(..1, paste0(c(..2, ..3), collapse = '|'))) %>%
flatten_chr() -> val
val
# mpgA|mpg_B cyl_A|B_cyl
# "mpg_true" "cyl_true"
And apply it on list of dataframes and combine them.
list(mtcarsA,mtcarsB) %>%
map_df(function(x) x %>% rename_with(~str_replace_all(.x, val)))
# mpg_true cyl_true disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4

With dplyr and group_by, is there a way to reference the original (full) dataset? [duplicate]

This question already has answers here:
Assign intermediate output to temp variable as part of dplyr pipeline
(6 answers)
Closed 3 years ago.
QUESTION: Is there a way to reference the original dataset OR (preferably) the dataset from the chain, right before the group_by() at all?
nrow(mtcars)
32 (but we all knew that)
> mtcars %>% group_by(cyl) %>% summarise(count = n())
# A tibble: 3 x 2
cyl count
<dbl> <int>
1 4 11
2 6 7
3 8 14
Great.
mtcars %>%
group_by(cyl) %>%
summarise(count = n(),
prop = n()/SOMETHING)
I understand I could put nrow(mtcars) in there, but this is just a MRE. That's not an option in more complex chain of operations.
Edit: I oversimplified the MRE. I am aware of the "." but I actually wanted to be able to pass the interim tibble off to another function (within the call to summarise), so the assign solution below does exactly what I was after. Thanks.
We can use add_count to count the number and create a new column of the original data frame. If we need more complex operation, we can further use mutate from there.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl) %>%
add_count()
# # A tibble: 32 x 12
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 7
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 7
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 11
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 7
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 14
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 7
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 14
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 11
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 11
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 7
# # ... with 22 more rows
You are after the ".":
mtcars %>%
group_by(cyl) %>%
summarise(count = n(),
prop = n()/nrow(.)) %>%
ungroup()

Resources