forcats::fct_lump_prop() behavior - r

I have been struggling using the {forcats} fct_lump_prop() function, more specifically the use of its w = and prop = arguments. What exactly is it supposed to represent with the example below :
df <- tibble(var1 = c(rep("a", 3), rep("b", 3)),
var2 = c(0.2, 0.3, 0.2, 0.1, 0.1, 0.1))
# A tibble: 6 x 2
var1 var2
<chr> <dbl>
1 a 0.2
2 a 0.3
3 a 0.2
4 b 0.1
5 b 0.1
6 b 0.1
What would be the prop value to set so that a is kept intact, but b is lumped as Others ?
# A tibble: 6 x 2
var1 var2 fct
<chr> <dbl> <fct>
1 a 0.2 a
2 a 0.3 a
3 a 0.2 a
4 b 0.1 Others
5 b 0.1 Others
6 b 0.1 Others
I have tried empirically but can't find how this works. Intuitively I would say that setting a prop value above 0.3 (sum of bs) and below 0.6 (sum of as) would lump the b factor but this doesn't lump anything. The only threshold I found was with prop = 0.7 and then everything becomes a Others level...
df %>%
mutate(fct = fct_lump_prop(var1, w = var2, prop = 0.7))
# A tibble: 6 x 3
var1 var2 fct
<fct> <dbl> <fct>
1 a 0.2 Other
2 a 0.3 Other
3 a 0.2 Other
4 b 0.1 Other
5 b 0.1 Other
6 b 0.1 Other
So what am I not understanding ? The little examples I found elsewhere did not help me grasp the prop and w behavior.
Thanks a lot.

Relevant explanation is in this line of the source of fct_lump_prop()
if (prop > 0 && sum(prop_n <= prop) <= 1) {
return(f) # No lumping needed
}
That is, if there are only one factor in "others" the function does nothing.
library(forcats)
library(dplyr)
df <- tibble(var1 = c(rep("a", 3), rep("b", 3),rep("c", 3)),
var2 = (1:9)/(45))
df %>% group_by(var1) %>% summarise(sum(var2))
#> # A tibble: 3 × 2
#> var1 `sum(var2)`
#> <chr> <dbl>
#> 1 a 0.133
#> 2 b 0.333
#> 3 c 0.533
df %>%
mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.45))
#> # A tibble: 9 × 3
#> var1 var2 fct
#> <chr> <dbl> <fct>
#> 1 a 0.0222 Other
#> 2 a 0.0444 Other
#> 3 a 0.0667 Other
#> 4 b 0.0889 Other
#> 5 b 0.111 Other
#> 6 b 0.133 Other
#> 7 c 0.156 c
#> 8 c 0.178 c
#> 9 c 0.2 c
df %>%
mutate(fct = fct_lump_prop(factor(var1), w = var2, prop = 0.25))
#> # A tibble: 9 × 3
#> var1 var2 fct
#> <chr> <dbl> <fct>
#> 1 a 0.0222 a
#> 2 a 0.0444 a
#> 3 a 0.0667 a
#> 4 b 0.0889 b
#> 5 b 0.111 b
#> 6 b 0.133 b
#> 7 c 0.156 c
#> 8 c 0.178 c
#> 9 c 0.2 c
Created on 2022-11-14 with reprex v2.0.2
It does'nt appear to be documented except in the source code itself.
Edit
There is also an issue in github because manpage says that factors are lumper if they appear "fewer than" prop times, but it is also true if it appears "exactly" prop times.

Related

Why `scale` when using mutate + across in dplyr create columns with `[,1]` at the end?

See code below.
the mutate(across(everything(), scale, .names = "{.col}_z")) part of the syntax is generating columns with [,1]appended at the end.
Two questions:
Why is this happening?
How can I avoid or remove it?
library(dplyr)
# Input
df_test <- tibble(x = c(1, 2, 3, 4), y = c(5, 6, 7, 8))
# My code generating x_z and y_z
df_scaled <- df_test %>%
mutate(across(everything(), scale, .names = "{.col}_z"))
# Output
df_scaled
#> # A tibble: 4 × 4
#> x y x_z[,1] y_z[,1]
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5 -1.16 -1.16
#> 2 2 6 -0.387 -0.387
#> 3 3 7 0.387 0.387
#> 4 4 8 1.16 1.16
Expected output
#> # A tibble: 4 × 4
#> x y x_z y_z
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5 -1.16 -1.16
#> 2 2 6 -0.387 -0.387
#> 3 3 7 0.387 0.387
#> 4 4 8 1.16 1.16
Created on 2022-12-30 with reprex v2.0.2
scale returns a matrix. We may either use c or extract the column with [ or use as.numeric to remove the dim attributes
library(dplyr)
df_test %>%
mutate(across(everything(),
~ as.numeric(scale(.x)), .names = "{.col}_z"))
-output
# A tibble: 4 × 4
x y x_z y_z
<dbl> <dbl> <dbl> <dbl>
1 1 5 -1.16 -1.16
2 2 6 -0.387 -0.387
3 3 7 0.387 0.387
4 4 8 1.16 1.16
i.e. check the output on a single column
> scale(df_test[[1]])
[,1]
[1,] -1.1618950
[2,] -0.3872983
[3,] 0.3872983
[4,] 1.1618950
attr(,"scaled:center")
[1] 2.5
attr(,"scaled:scale")
[1] 1.290994
If we check the source code
> scale.default
function (x, center = TRUE, scale = TRUE)
{
x <- as.matrix(x) # it is converting to matrix
...
and is required in applying apply/colMeans/sweep, thus when we pass a vector to the scale, it does convert it to a single column matrix
> as.matrix(df_test$x)
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
We could rewrite the scale function -> my_scale:
my_scale <- function(x){
require(caret)
var <- preProcess(data.frame(x),c("center", "scale"))
predict(var, data.frame(x))$x
}
df_test %>%
mutate(across(everything(), ~my_scale(.), .names = "{col}_z"))
x y x_z y_z
<dbl> <dbl> <dbl> <dbl>
1 1 5 -1.16 -1.16
2 2 6 -0.387 -0.387
3 3 7 0.387 0.387
4 4 8 1.16 1.16

How I can calculate correlation between each variable within each group in R using dplyr package?

Let's say i have data frame in R that looks like this :
var2 = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3),rep("F",3),
rep("H",3),rep("I",3))
y2 = c(-1.23, -0.983, 1.28, -0.268, -0.46, -1.23,
1.87, 0.416, -1.99, 0.289, 1.7, -0.455,
-0.648, 0.376, -0.887,0.534,-0.679,-0.923,
0.987,0.324,-0.783,-0.679,0.326,0.998);length(y2)
group2 = c(rep(1,6),rep(2,6),rep(3,6),rep(1,6))
data2 = tibble(var2,group2,y2)
with output :
# A tibble: 24 × 3
var2 group2 y2
<chr> <dbl> <dbl>
1 A 1 -1.23
2 A 1 -0.983
3 A 1 1.28
4 B 1 -0.268
5 B 1 -0.46
6 B 1 -1.23
7 C 2 1.87
8 C 2 0.416
9 C 2 -1.99
10 D 2 0.289
11 D 2 1.7
12 D 2 -0.455
13 E 3 -0.648
14 E 3 0.376
15 E 3 -0.887
16 F 3 0.534
17 F 3 -0.679
18 F 3 -0.923
19 H 1 0.987
20 H 1 0.324
21 H 1 -0.783
22 I 1 -0.679
23 I 1 0.326
24 I 1 0.998
i want to calculate the correlation of each distinct pair in R within each group using dplyr.
Ideally i want the resulted tibble to look like this (the 4th column to contain the values of each correlation pair):
which ideally must look like this :
group
var1
var2
value
1
A
B
cor(A,B)
1
A
H
cor(A,H)
1
A
I
cor(A,I)
1
B
H
cor(B,H)
1
B
I
cor(B,I)
1
H
I
cor(H,I)
2
C
D
cor(C,D)
3
E
F
cor(E,F)
How i can do that in R ?
Any help ?
A possible solution:
library(tidyverse)
data2 %>%
group_by(group2) %>%
group_split() %>%
map(\(x) x %>% group_by(var2) %>%
group_map(~ data.frame(.x[-1]) %>% set_names(.y)) %>%
bind_cols() %>% cor %>%
{data.frame(row = rownames(.)[row(.)[upper.tri(.)]],
col = colnames(.)[col(.)[upper.tri(.)]],
corr = .[upper.tri(.)])}) %>%
imap_dfr(~ data.frame(group = .y, .x))
#> group row col corr
#> 1 1 A B -0.9949738
#> 2 1 A H -0.9581357
#> 3 1 B H 0.9819901
#> 4 1 A I 0.8533855
#> 5 1 B I -0.9012948
#> 6 1 H I -0.9669093
#> 7 2 C D 0.4690460
#> 8 3 E F -0.1864518
if you are okay with repeating the functions you can do:
fun <- function(x, y){
a <- split(x, y)
col1 <- combn(names(a), 2, paste, collapse = '_')
col2 <- combn(unname(a), 2, do.call, what='cor')
data.frame(vars = col1, cor = col2)
}
data2 %>%
group_by(group2)%>%
summarise(fun(y2, var2), .groups = 'drop')
# A tibble: 8 x 3
# Groups: group2 [3]
group2 vars cor
<dbl> <chr> <dbl>
1 1 A_B -0.995
2 1 A_H -0.958
3 1 A_I 0.853
4 1 B_H 0.982
5 1 B_I -0.901
6 1 H_I -0.967
7 2 C_D 0.469
8 3 E_F -0.186
If you do not want to repeat the functions as the process might be expensive, you can do:
data2 %>%
group_by(group2)%>%
summarise(s=combn(split(y2, var2), 2,
\(x)stack(setNames(cor(x[[1]], x[[2]]), paste(names(x), collapse='_'))),
simplify = FALSE),.groups = 'drop') %>%
unnest(s)
# A tibble: 8 x 3
group2 values ind
<dbl> <dbl> <fct>
1 1 -0.995 A_B
2 1 -0.958 A_H
3 1 0.853 A_I
4 1 0.982 B_H
5 1 -0.901 B_I
6 1 -0.967 H_I
7 2 0.469 C_D
8 3 -0.186 E_F
Another option would be widyr::pairwise_cor which requires to first add an identifier for the "observation":
library(widyr)
library(dplyr)
data2 %>%
group_by(var2, group2) %>%
mutate(obs = row_number()) |>
ungroup() %>%
split(.$group2) %>%
lapply(function(x) widyr::pairwise_cor(x, var2, obs, y2, upper = FALSE)) %>%
bind_rows(.id = "group2")
#> # A tibble: 8 × 4
#> group2 item1 item2 correlation
#> <chr> <chr> <chr> <dbl>
#> 1 1 A B -0.995
#> 2 1 A H -0.958
#> 3 1 B H 0.982
#> 4 1 A I 0.853
#> 5 1 B I -0.901
#> 6 1 H I -0.967
#> 7 2 C D 0.469
#> 8 3 E F -0.186

dplyr `slice_max` interpolation not working

Given a data.frame:
library(tidyverse)
set.seed(0)
df <- tibble(A = 1:10, B = rnorm(10), C = rbinom(10,2,0.6))
var <- "B"
I'd like to get filter the data frame by the highest values of the variable in var. Logically, I'd do either:
df %>%
slice_max({{ var }}, n = 5)
#> # A tibble: 1 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 1 1.26 1
df %>%
slice_max(!! var, n = 5)
#> # A tibble: 1 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 1 1.26 1
But neither interpolation is working... what am I missing here?
Expected output would be the same as:
df %>%
slice_max(B, n = 5)
#> # A tibble: 5 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 10 2.40 0
#> 2 3 1.33 2
#> 3 4 1.27 1
#> 4 1 1.26 1
#> 5 5 0.415 2
I think you need to use the newer .data version as outlined here:
df %>%
slice_max(.data[[var]] , n = 5)
#> # A tibble: 5 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 10 2.40 0
#> 2 3 1.33 2
#> 3 4 1.27 1
#> 4 1 1.26 1
#> 5 5 0.415 2
I am puzzled by why your approach is get the first row only though!
We may convert to sym and evaluate (!!)
library(dplyr)
df %>%
slice_max(!! rlang::sym(var), n = 5)
-output
# A tibble: 5 × 3
A B C
<int> <dbl> <int>
1 10 2.40 0
2 3 1.33 2
3 4 1.27 1
4 1 1.26 1
5 5 0.415 2

Making data wide where row names are not identical

I am trying to get a model results table into wide format. Since names are not the same on outcomes (dv variables), NA's show up in the table and I can't find a way to have one row per variable.
I need one row per variable/dv. Model 1 and 3 share all variables other than one.
Data:
table <- data.frame(variable=c("intercept", "a", "b", "intercept", "c", "intercept", "a", "e", "intercept", "c"),
b=c(1.2, 0.1, 0.4, 0.3, 0.9, 1.3, 2, .23, .4, .7),
p=(abs(rnorm(10, 0, .3))),
model=c(1,1,1,2,2,3,3,3,4,4),
dv=c(rep("dv1", 5), rep("dv2", 5)))
> table
variable b p model dv
1 intercept 1.20 0.03320481 1 dv1
2 a 0.10 0.16675234 1 dv1
3 b 0.40 0.53607394 1 dv1
4 intercept 0.30 0.14935514 2 dv1
5 c 0.90 0.58998515 2 dv1
6 intercept 1.30 0.21040677 3 dv2
7 a 2.00 0.14183742 3 dv2
8 e 0.23 0.32034711 3 dv2
9 intercept 0.40 0.06539247 4 dv2
10 c 0.70 0.30780133 4 dv2
Code:
table %>%
gather(key, value, b, p) %>% unite("stat_var", dv, key, sep=".") %>%
spread(stat_var, value) %>%
arrange(model, desc(variable))
Output:
variable model dv1.b dv1.p dv2.b dv2.p
1 intercept 1 1.2 0.21866737 NA NA
2 b 1 0.4 0.50600799 NA NA
3 a 1 0.1 0.18751178 NA NA
4 intercept 2 0.3 0.25133611 NA NA
5 c 2 0.9 0.04601194 NA NA
6 intercept 3 NA NA 1.30 0.34144108
7 e 3 NA NA 0.23 0.12793927
8 a 3 NA NA 2.00 0.37614448
9 intercept 4 NA NA 0.40 0.08852144
10 c 4 NA NA 0.70 0.26853770
Looking for:
I will ignore the reason (I see there's no valid reason) to consider some type of "equivalence" between models (handled it with mutate()). But related only with the table manipulation, I have this basic option to get your desired output:
You can use pivot_wider_spec() to set the names b and p like suffixs.
require(tidyverse)
table %>%
mutate(model = case_when(model == 3 ~ 1,
model == 4 ~ 2,
TRUE ~ model)) %>%
pivot_wider(names_from = dv, values_from = c("b", "p")) %>%
select(variable,
model,
ends_with("dv1"),
ends_with("dv2"))
# A tibble: 6 x 6
# variable model b_dv1 p_dv1 b_dv2 p_dv2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 intercept 1 1.2 0.318 1.3 0.200
# 2 a 1 0.1 0.120 2 0.419
# 3 b 1 0.4 0.309 NA NA
# 4 intercept 2 0.3 0.350 0.4 0.0148
# 5 c 2 0.9 0.185 0.7 0.530
# 6 e 1 NA NA 0.23 0.174
As I stated in my comment, it seems that your expected output is wrong.
However, you can reproduce it by tweaking the model variable:
table %>%
select(model, everything()) %>%
mutate(model=ifelse(model>2, model-2, model)) %>%
pivot_longer(c(b, p)) %>%
unite("name", c("dv", "name")) %>%
pivot_wider()
# # A tibble: 6 x 6
# model variable dv1_b dv1_p dv2_b dv2_p
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 intercept 1.2 0.193 1.3 0.160
# 2 1 a 0.1 0.650 2 0.476
# 3 1 b 0.4 0.190 NA NA
# 4 2 intercept 0.3 0.0435 0.4 0.145
# 5 2 c 0.9 0.372 0.7 0.243
# 6 1 e NA NA 0.23 0.297
Of note, gather() and spread() are deprecated in favor of pivoting functions which offer very nice improvements (although not used here).

Tidy replacement of missing values in conjunction with using a predicate function

What is the recommended tidy way of replacing NAs in conjunction with using a
predicate function?
I was hoping to leverage tidyr::replace_na() (or a similar predefined missing value handler) in some way, but I can't seem to get it to work with either a purrr or dplyr way of using predicate functions.
library(magrittr)
# Example data:
df <- tibble::tibble(
id = c(rep("A", 3), rep("B", 3)),
x = c(1, 2, NA, 10, NA, 30),
y = c("a", NA, "c", NA, NA, "f")
)
# Works, but needs manual spec of columns that should be handled:
df %>%
tidyr::replace_na(list(x = 0))
#> # A tibble: 6 x 3
#> id x y
#> <chr> <dbl> <chr>
#> 1 A 1 a
#> 2 A 2 <NA>
#> 3 A 0 c
#> 4 B 10 <NA>
#> 5 B 0 <NA>
#> 6 B 30 f
# Doesn't work (at least not in the intended way):
df %>%
dplyr::mutate_if(
function(.x) inherits(.x, c("integer", "numeric")),
~tidyr::replace_na(0)
)
#> # A tibble: 6 x 3
#> id x y
#> <chr> <dbl> <chr>
#> 1 A 0 a
#> 2 A 0 <NA>
#> 3 A 0 c
#> 4 B 0 <NA>
#> 5 B 0 <NA>
#> 6 B 0 f
# Works, but uses an inline def of the replacement function:
df %>%
dplyr::mutate_if(
function(.x) inherits(.x, c("integer", "numeric")),
function(.x) dplyr::if_else(is.na(.x), 0, .x)
)
#> # A tibble: 6 x 3
#> id x y
#> <chr> <dbl> <chr>
#> 1 A 1 a
#> 2 A 2 <NA>
#> 3 A 0 c
#> 4 B 10 <NA>
#> 5 B 0 <NA>
#> 6 B 30 f
# Works, but uses an inline def of the replacement function:
df %>%
purrr::modify_if(
function(.x) inherits(.x, c("integer", "numeric")),
function(.x) dplyr::if_else(is.na(.x), 0, .x)
)
#> # A tibble: 6 x 3
#> id x y
#> <chr> <dbl> <chr>
#> 1 A 1 a
#> 2 A 2 <NA>
#> 3 A 0 c
#> 4 B 10 <NA>
#> 5 B 0 <NA>
#> 6 B 30 f
Created on 2019-01-21 by the reprex package (v0.2.1)
If we are using ~, then specify the . also i.e.
df %>%
mutate_if(function(.x) inherits(.x, c("integer", "numeric")),
~ replace_na(., 0))
# A tibble: 6 x 3
# id x y
# <chr> <dbl> <chr>
#1 A 1 a
#2 A 2 <NA>
#3 A 0 c
#4 B 10 <NA>
#5 B 0 <NA>
#6 B 30 f
otherwise, just do
df %>%
mutate_if(function(.x) inherits(.x, c("integer", "numeric")),
replace_na, replace = 0)
# A tibble: 6 x 3
# id x y
# <chr> <dbl> <chr>
#1 A 1 a
#2 A 2 <NA>
#3 A 0 c
#4 B 10 <NA>
#5 B 0 <NA>
#6 B 30 f
Or another variation is
df %>%
mutate_if(funs(inherits(., c("integer", "numeric"))),
~ replace_na(., 0))

Resources