Passing multiple arguments into map() with prop.test() with dplyr - r

could anybody help my to change the rowwise() + prop.test() to map2?
The issue is, that map2 takes .x and .y arguments, but I need to pass 4 columns: 2xsucces and 2xtries into the prop.test() function. I want to extract the p-value into new column p.
This code runs OK'isch:
library(tidyverse)
cc <- tribble(
~n1, ~x1, ~n2, ~x2,
1000,100,900,85,
1000,100,100,10,
1000,100,10,10
)
cc %>%
rowwise()%>%
mutate(p = prop.test(x=c(x1, x2),
n=c(n1, n2),
conf.level=0.95)$p.value)%>%
mutate(p=round(p,5))
#> Warning in prop.test(x = c(x1, x2), n = c(n1, n2), conf.level = 0.95): Chi-
#> squared approximation may be incorrect
#> # A tibble: 3 × 5
#> # Rowwise:
#> n1 x1 n2 x2 p
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1000 100 900 85 0.741
#> 2 1000 100 100 10 1
#> 3 1000 100 10 10 0
But I noticed that I can gain speed with a for loop:
cc <- tribble(
~n1, ~x1, ~n2, ~x2,
1000,100,900,85,
1000,100,100,10,
1000,100,10,10
)
cc %>%
rowwise()%>%
mutate(p = prop.test(x=c(x1, x2),
n=c(n1, n2),
conf.level=0.95)$p.value)%>%
mutate(p=round(p,5))
cc$p <-999
for (i in 1:nrow(cc)){
result <- prop.test(x=c(cc$x1[[i]], cc$x2[[i]]),
n = c(cc$n1[[i]], cc$n2[[i]]),
conf.level=0.95)
cc$p[[i]] <- result$p.value
}
cc %>%
mutate(p=round(p,5))
}
But I ask myself it there is more elegant way to use map function from dplyr in order to write less code and achieve the same results?
Thanks in advance

There is pmap:
library(tidyverse)
cc <- tribble(
~n1, ~x1, ~n2, ~x2,
1000, 100, 900, 85,
1000, 100, 100, 10,
1000, 100, 10, 10
)
cc %>%
mutate(
signif = list(x1, x2, n1, n2) %>% pmap_dbl(~ {
prop.test(
x = c(..1, ..2),
n = c(..3, ..4),
conf.level = 0.95
)$p.value
})
)
#> Warning in prop.test(x = c(..1, ..2), n = c(..3, ..4), conf.level = 0.95): Chi-
#> squared approximation may be incorrect
#> # A tibble: 3 x 5
#> n1 x1 n2 x2 signif
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1000 100 900 85 7.41e- 1
#> 2 1000 100 100 10 1 e+ 0
#> 3 1000 100 10 10 9.46e-18
Created on 2022-02-22 by the reprex package (v2.0.0)
For loops are usually very slow, and the code is hard to read and often unnecessary when it comes to the scenario of just apply a function for each element.
The thing which slows down your dplyr code is rowwise, which is not needed anymore using the map function.

Related

Create parameterized summaries of a column

I have a tibble and I want create several summaries of the same column, specifically the first, second and third quartiles.
To do it, I create a named list of functions and that works fine.
library("tidyverse")
set.seed(1234)
df <- tibble(x = rnorm(100))
df %>%
summarise(
across(x,
list(
Q1 = ~ quantile(., 1 / 4),
Q2 = ~ quantile(., 2 / 4),
Q3 = ~ quantile(., 3 / 4)
),
.names = "{.fn}"
)
)
#> # A tibble: 1 × 3
#> Q1 Q2 Q3
#> <dbl> <dbl> <dbl>
#> 1 -0.895 -0.385 0.471
Can I achieve this by specifying the list of probabilities to pass to quantile? So that I save myself typing and more importantly avoid hard-coding the arguments to pass to the aggregating function.
The following doesn't work because it creates one row per probability rather than one column.
df %>%
summarise(
across(x, quantile, 1:3 / 4)
)
#> # A tibble: 3 × 1
#> x
#> <dbl>
#> 1 -0.895
#> 2 -0.385
#> 3 0.471
you're almost here
df <- tibble(x = rnorm(100))
df %>%
summarise(
across(x,
map(1:3, ~partial(quantile, probs=./4)),
.names = "Q{.fn}"
)
)
# A tibble: 1 x 3
Q1 Q2 Q3
<dbl> <dbl> <dbl>
1 -0.579 0.0815 0.475
If you define the quantiles like this:
Q <- c(0.25, 0.5, 0.75)
Then the following code will produce columns of the appropriate quantiles with sensible labels:
df %>%
summarise(
across(x,
setNames( lapply(Q,
function(x) { f <- ~quantile(., b); f[2][[1]][[3]] <- x; f }),
paste("Q", round(100 * Q), sep = "_")),
.names = "{.fn}"
)
)
#> # A tibble: 1 x 3
#> Q_25 Q_50 Q_75
#> <dbl> <dbl> <dbl>
#> 1 -0.895 -0.385 0.471
Created on 2022-06-29 by the reprex package (v2.0.1)

Using a function in R with multiple outcomes to create multiple columns in mutate

I am using a self declared function that runs a regression analysis. I want to run this for thousands of companies for multiple years, thus speed is essential. My function creates three outputs (a coefficient, the p value and r-squared). The function runs fine individually, however when I use mutate() to let it run through the whole dataset, it only gives the same values for all rows. The weirdest thing is that I can't reproduce those particular values by running the function individually. I made an reproducible example below. I have used lapply successfully before with this data, but I would like to keep it in mutate and above all I would like to know what's exactly happening here.
So my question is: how can I make this function work for each individual row for the companies dataset using mutate?
library(tidyverse)
companies <- data.frame(comp_id = 1:5)
individuals <- data.frame(id = 1:100,
comp_id = sample(1:5, 100, replace = T),
age = sample(18:67, 100, replace = T),
wage = sample(1700:10000, 100, replace = T))
regger <- function(x){
df <- individuals %>% filter(comp_id == x)
formula <- wage ~ age
regression <- lm(formula, df)
res <- list(coeff = summary(regression)$coefficient[2,1],
p = summary(regression)$coefficients[2,4],
r2 = summary(regression)$r.squared)
return(res)
}
companies %>%
mutate(data = list(regger(comp_id))) %>%
unnest_wider(data)
output:
# A tibble: 5 x 4
comp_id coeff p r2
<int> <dbl> <dbl> <dbl>
1 1 -4.92 0.916 0.000666
2 2 -4.92 0.916 0.000666
3 3 -4.92 0.916 0.000666
4 4 -4.92 0.916 0.000666
5 5 -4.92 0.916 0.000666
Use map from the purrr package if a function is not vectorized:
library(tidyverse)
set.seed(1337)
companies <- data.frame(comp_id = 1:5)
individuals <- data.frame(
id = 1:100,
comp_id = sample(1:5, 100, replace = T),
age = sample(18:67, 100, replace = T),
wage = sample(1700:10000, 100, replace = T)
)
regger <- function(x) {
df <- individuals %>% filter(comp_id == x)
formula <- wage ~ age
regression <- lm(formula, df)
res <- list(
coeff = summary(regression)$coefficient[2, 1],
p = summary(regression)$coefficients[2, 4],
r2 = summary(regression)$r.squared
)
return(res)
}
companies %>%
mutate(data = comp_id %>% map(regger)) %>%
unnest_wider(data)
#> # A tibble: 5 x 4
#> comp_id coeff p r2
#> <int> <dbl> <dbl> <dbl>
#> 1 1 67.1 0.108 0.218
#> 2 2 23.7 0.466 0.0337
#> 3 3 31.2 0.292 0.0462
#> 4 4 18.4 0.582 0.0134
#> 5 5 0.407 0.994 0.00000371
Created on 2021-09-09 by the reprex package (v2.0.1)
I'm not sure what the output should look like, but could it be that you need to work on a row-by-row basis?
companies %>%
rowwise() %>%
mutate(data = list(regger(comp_id))) %>%
unnest_wider(data)
comp_id coeff p r2
<int> <dbl> <dbl> <dbl>
1 1 21.6 0.470 0.0264
2 2 13.5 0.782 0.00390
3 3 0.593 0.984 0.0000175
4 4 -9.33 0.824 0.00394
5 5 64.9 0.145 0.156

Speed up the applications of the function "cummean" on 4 vectors

I have 2 vectors with the same length x,y. Then x^2,y^2 are square (element-wise) of x,y respectively. In each iteration, I need to apply function cummean on x,y,x^2,y^2.
I would like to ask if I can speed up the process someway rather than running 4 separate operations.
library(dplyr)
x <- c(1, 2, 3)
y <- c(5, 5, 6)
dplyr::cummean(x)
dplyr::cummean(y)
dplyr::cummean(x^2)
dplyr::cummean(y^2)
Thank you so much for your suggestion!
I guess you could do something like:
tibble(x, y) %>%
mutate(across(1:2, ~.x^2, .names = c("{col}^2"))) %>%
mutate(across(1:4, cummean, .names = "cummean_{col}"))
#> # A tibble: 3 x 8
#> x y `x^2` `y^2` cummean_x cummean_y `cummean_x^2` `cummean_y^2`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5 1 25 1 5 1 25
#> 2 2 5 4 25 1 5 1 25
#> 3 3 6 9 36 1.33 5 2 25
And if you want the variables in the global environment rather than in a tibble you could do:
tibble(x, y) %>%
mutate(across(1:2, ~.x^2, .names = c("{col}^2"))) %>%
mutate(across(1:4, cummean, .names = "cummean_{col}")) %>%
as.list() %>%
list2env(envir = globalenv())
Or in a function if you had to do this a lot you could do:
func <- function(x, y)
{
tibble(x, y) %>%
mutate(across(1:2, ~.x^2, .names = c("{col}^2"))) %>%
mutate(across(1:4, cummean, .names = "cummean_{col}")) %>%
as.list() %>%
list2env(envir = parent.frame())
}

using functionals with variable obtained from local environment in R (a lexical scoping issue)

I'm working on a classification-type problem using some simulations, and I want to count the number of true positives, false positives, etc. using different thresholds.
For example, consider the following example:
library(tidyverse)
set.seed(23)
n <- 100
df <- tibble(
class = sample(LETTERS[1:5], 100, replace = TRUE),
pred_class = sample(LETTERS[1:5], 100, replace = TRUE),
correct = class == pred_class,
pval = runif(100, 0, 1)
) %>%
print()
#> # A tibble: 100 x 4
#> class pred_class correct pval
#> <chr> <chr> <lgl> <dbl>
#> 1 C E FALSE 0.643
#> 2 B C FALSE 0.561
#> 3 B C FALSE 0.824
#> 4 D A FALSE 0.971
#> 5 E A FALSE 0.0283
#> 6 C D FALSE 0.723
#> 7 E D FALSE 0.521
#> 8 E D FALSE 0.619
#> 9 E E TRUE 0.198
#> 10 E B FALSE 0.386
#> # ... with 90 more rows
For a fixed cut-off, the task is trivial (please ignore the direction of the assignments, they are correct for the actual task I'm working on, but I do recognize they might appear backwards here). This is what I'm trying to accomplish, but with more than 1 cutoff:
df %>%
summarize(
cutoff = 0.05,
TP = sum(!correct & pval < 0.05),
FP = sum(correct & pval < 0.05),
FN = sum(!correct & pval >= 0.05),
TN = sum(correct & pval >= 0.05)
)
#> # A tibble: 1 x 5
#> cutoff TP FP FN TN
#> <dbl> <int> <int> <int> <int>
#> 1 0.05 5 1 73 21
But for multiple cutoffs, say a <- c(0.01, 0.05, 0.1) or a <- seq(0, .15, 0.01), this is a lot of cut and paste.
So my goal is to figure out how to do this with functionals and (I think?) summarize_at.
Unfortunately, this is giving me issues.
I can get this to work when the sums are based on single variables. It's ugly, but the following works:
# define the functionals (note only 2 since we are only looking at 1 variable)
a <- c(0.01, 0.05, 0.1)
pfun <- list(
less_p = function(a) {function(p) sum(p < a)},
more_p = function(a) {function(p) sum(p >= a)}
) %>%
imap(~list(f = .x, label = .y))
fun_list <- cross(list(alpha = alpha, f = pfun)) %>% map(function(x) {
list(
f = x$f$f(x$alpha),
label = paste(x$f$label, x$alpha, sep = "_")
)
}) %>%
set_names(., map_chr(., ~ .x$label)) %>%
map(~ .x$f)
df %>%
summarize_at(
.vars = vars(pval),
.funs = funs(!!!fun_list)
)
#> # A tibble: 1 x 10
#> less_p_0.01 less_p_0.02 less_p_0.03 less_p_0.04 less_p_0.05 more_p_0.01
#> <int> <int> <int> <int> <int> <int>
#> 1 1 3 4 4 6 99
#> # ... with 4 more variables: more_p_0.02 <int>, more_p_0.03 <int>,
#> # more_p_0.04 <int>, more_p_0.05 <int>
Some gather, separate and spread fun, and this will be in the desired format.
However, when we write the functionals using the variable correct too, it breaks because correct isn't found:
afun <- list(
TP_fun = function(a) { function(p) sum(!correct & p < a)},
FP_fun = function(a) { function(p) sum( correct & p < a)},
FN_fun = function(a) { function(p) sum(!correct & p >= a)},
TN_fun = function(a) { function(p) sum( correct & p >= a)}
) %>%
imap(~list(f = .x, label = .y))
# all combinations of alpha and the functions
fun_list <- cross(list(alpha = alpha, f = afun)) %>% map(function(x) {
list(
f = x$f$f(x$alpha),
label = paste(x$f$label, x$alpha, sep = "_")
)
}) %>%
set_names(., map_chr(., ~ .x$label)) %>%
map(~ .x$f)
df %>%
summarize_at(
.vars = vars(pval),
.funs = funs(!!!fun_list)
)
#> Error in summarise_impl(.data, dots): Evaluation error: object 'correct' not found.
I tried replacing correct in the functionals with .$correct, but this does not solve the problem. What is the best way to reference additional variables from within the functional?
As an aside - I feel like there should be a simpler solution to this problem. If I'm over-complicating a simple problem, please feel free to
Created on 2019-01-30 by the reprex package (v0.2.1)

Tidyverse syntax for calculating precision and recall

I am trying to calculate AUC, Precision, Recall, Accuracy for every group in my data frame (i have a single data frame that has predicted data from three different models concatenated).
What is the tidyverse syntax to do it? I want to use the yardstick package by Max Kuhn to calculate these metrics.
Here is a sample df and here is where I got so far:
> library(tidyverse)
> library(yardstick)
>
> sample_df <- data_frame(
+ group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
+ true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
+ pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
+ ) %>%
+ mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
>
> sample_df
# A tibble: 15 x 4
group_type true_label pred_prob pred_label
<chr> <fct> <dbl> <fct>
1 a 1 0.327 0
2 a 1 0.286 0
3 a 0 0.0662 0
4 a 0 0.993 1
5 a 0 0.835 1
6 b 0 0.975 1
7 b 0 0.436 0
8 b 0 0.585 1
9 b 0 0.478 0
10 b 1 0.541 1
11 c 1 0.247 0
12 c 0 0.608 1
13 c 0 0.215 0
14 c 0 0.937 1
15 c 0 0.819 1
>
Metrics:
> # metrics for the full data
> precision(sample_df, truth = true_label, estimate = pred_label)
[1] 0.5714286
> recall(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3636364
> accuracy(sample_df, truth = true_label, estimate = pred_label)
[1] 0.3333333
> roc_auc(sample_df, truth = true_label, pred_prob)
[1] 0.7727273
>
Now how do i get these metrics for each group in my dataset??
sample_df %>%
group_by(group_type) %>%
summarize(???)
An example using unnest:
sample_df %>%
group_by(group_type) %>%
do(auc = roc_auc(., true_label, pred_prob),
acc = accuracy(., true_label, pred_label),
recall = recall(., true_label, pred_label),
precision = precision(., true_label, pred_label)) %>% unnest
HOWEVER,
I would actually suggest to not use yardstick because it doesn't play nice with dplyr summarize. Actually, it just uses the ROCR package under the hood. I would just make your own functions that take in two variables.
yardstick is flawed because it requires a data.frame as it's first input, it is trying to be too clever. Under the dplyr framework, that isn't necessary because of summarize and mutate as functions already see the variables inside a data.frame without an explicit data parameter.
As others have noted, the functions in yardstick don't really play nice with grouped data frames (at least as of yet). A workaround could be to work with nested data.
In order to reduce replication, it's probably also a good idea to write a simple wrapper function that calculates all of the summary metrics you want in one call. Here's an example of how you could go about doing just that:
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-09
Set up first:
library(tidyverse)
library(yardstick)
set.seed(1)
# Given sample data
sample_df <- data_frame(
group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
) %>%
mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
#> Warning: package 'bindrcpp' was built under R version 3.3.3
Here are the wrappers:
# Wrapper to calculate several metrics from same data
performance_metrics <- function(data, truth, estimate, prob) {
metrics <- lst(precision, recall, accuracy) # these all share arguments
values <- invoke_map_df(metrics, list(list(data)), truth, estimate)
roc <- roc_auc(sample_df, truth, prob) # bit different here
bind_cols(values, roc_auc = roc)
}
# Wrap the wrapper with default arguments
metrics <- partial(performance_metrics,
truth = "true_label",
estimate = "pred_label",
prob = "pred_prob")
And apply to groups by nesting the data:
sample_df %>%
nest(-group_type) %>%
mutate(metrics = map(data, metrics)) %>%
unnest(metrics)
#> # A tibble: 3 x 6
#> group_type data precision recall accuracy roc_auc
#> <chr> <list> <dbl> <dbl> <dbl> <dbl>
#> 1 a <tibble [5 x 3]> 0.5000000 0.2500000 0.2 0.5909091
#> 2 b <tibble [5 x 3]> 0.6666667 0.6666667 0.6 0.5909091
#> 3 c <tibble [5 x 3]> 0.7500000 0.7500000 0.6 0.5909091
I managed to do it by spiting a data frame to a list and mapping the function to each list element:
library(tidyverse)
library(yardstick)
sample_df %>%
split(.$group_type) %>%
map_dfr(precision, true_label, pred_label)
#output
## A tibble: 1 x 3
a b c
<dbl> <dbl> <dbl>
1 0.500 0.667 1.00
it seems group_by is not supported by yardstick functions yet
This also works:
sample_df %>%
split(.$group_type) %>%
map_dfr(function(x){
prec = precision(x, true_label, pred_label)
rec = recall(x, true_label, pred_label)
return(data.frame(prec, rec))
})
I used the example in http://r4ds.had.co.nz/many-models.html
It uses nest but also uses precision as you requested.
library(tidyverse)
library(yardstick)
sample_df <- data_frame(group_type = rep(c('a', 'b', 'c'), each = 5), # repeats each element 5 times
true_label = as.factor(rbinom(15, 1, 0.3)), # generates 1 with 30% prob
pred_prob = runif(15, 0, 1) # generates 15 decimals between 0 and 1 from uniform dist
) %>%
mutate(pred_label = as.factor(if_else(pred_prob > 0.5, 1, 0)))
by_group_type <- sample_df %>% group_by(group_type) %>% nest()
stick_m_1 <- function(df){
precision(df,truth = true_label, estimate = pred_label)
}
models <- map(by_group_type$data,stick_m_1)
models

Resources