Speed up the applications of the function "cummean" on 4 vectors - r

I have 2 vectors with the same length x,y. Then x^2,y^2 are square (element-wise) of x,y respectively. In each iteration, I need to apply function cummean on x,y,x^2,y^2.
I would like to ask if I can speed up the process someway rather than running 4 separate operations.
library(dplyr)
x <- c(1, 2, 3)
y <- c(5, 5, 6)
dplyr::cummean(x)
dplyr::cummean(y)
dplyr::cummean(x^2)
dplyr::cummean(y^2)
Thank you so much for your suggestion!

I guess you could do something like:
tibble(x, y) %>%
mutate(across(1:2, ~.x^2, .names = c("{col}^2"))) %>%
mutate(across(1:4, cummean, .names = "cummean_{col}"))
#> # A tibble: 3 x 8
#> x y `x^2` `y^2` cummean_x cummean_y `cummean_x^2` `cummean_y^2`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 5 1 25 1 5 1 25
#> 2 2 5 4 25 1 5 1 25
#> 3 3 6 9 36 1.33 5 2 25
And if you want the variables in the global environment rather than in a tibble you could do:
tibble(x, y) %>%
mutate(across(1:2, ~.x^2, .names = c("{col}^2"))) %>%
mutate(across(1:4, cummean, .names = "cummean_{col}")) %>%
as.list() %>%
list2env(envir = globalenv())
Or in a function if you had to do this a lot you could do:
func <- function(x, y)
{
tibble(x, y) %>%
mutate(across(1:2, ~.x^2, .names = c("{col}^2"))) %>%
mutate(across(1:4, cummean, .names = "cummean_{col}")) %>%
as.list() %>%
list2env(envir = parent.frame())
}

Related

how to filter on column==var when var has same name as column? (inside pmap)

I have a tibble that I want to filter by comparing its columns against some variables. However, it's convenient for that variable to have the same name as the column. How can I force dplyr to evaluate the variable so it doesn't confuse the variable and column names?
set.seed(2)
ngrp <- 3
npergrp <- 4
tib <- tibble(grp=rep(letters[1:ngrp], each=npergrp),
N=rep(1:npergrp, ngrp),
val=round(runif(npergrp*ngrp))) %>% print(n=Inf)
grp <- grp_ <- 'a'
tib %>% dplyr::filter(grp==grp_) %>% glimpse() ## works
tib %>% dplyr::filter(grp==grp) %>% glimpse() ## undesired result, grp==grp always true
tib %>% dplyr::filter(grp=={{grp}}) %>% glimpse() ## hey it works!
## slightly less toy example
tib %>% dplyr::filter(grp==grp_) %>%
dplyr::mutate(
the_rest = purrr::pmap(
.,
function(grp, N, ...) {
gg <- grp ## there must be a better way
NN <- N
tib %>%
dplyr::filter(
# grp!=grp, ## always false
# N==N ## always true
grp!=gg,
N==NN
) %>%
dplyr::pull(val) %>%
sum()
}
),
no_hugs = purrr::pmap(
.,
function(grp, N, ...) {
tib %>%
dplyr::filter(
grp!={{grp}}, ## ERROR! oh noes!
N=={{N}}
) %>%
dplyr::pull(val) %>%
sum()
}
)
) %>%
tidyr::unnest() %>%
glimpse()
output:
# A tibble: 12 × 3
grp N val
<chr> <int> <dbl>
1 a 1 0
2 a 2 1
3 a 3 1
4 a 4 0
5 b 1 1
6 b 2 1
7 b 3 0
8 b 4 1
9 c 1 0
10 c 2 1
11 c 3 1
12 c 4 0
Rows: 4
Columns: 3
$ grp <chr> "a", "a", "a", "a"
$ N <int> 1, 2, 3, 4
$ val <dbl> 0, 1, 1, 0
Rows: 4
Columns: 3
$ grp <chr> "a", "a", "a", "a"
$ N <int> 1, 2, 3, 4
$ val <dbl> 0, 1, 1, 0
Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems?
# the_rest should be 1, 2, 1, 1
As often happens, writing the question taught me how to embrace variables using the double curly brace operator {{}}
https://dplyr.tidyverse.org/articles/programming.html
Use dynamic name for new column/variable in `dplyr`
However, it doesn't work inside the pmap.
It would need .env to evaluate the object 'grp' from the environment other than the data environment (or use !!)
library(dplyr)
tib %>%
dplyr::filter(grp==.env$grp)
-output
# A tibble: 4 × 3
grp N val
<chr> <int> <dbl>
1 a 1 0
2 a 2 1
3 a 3 1
4 a 4 0
The .env can be used similarly within the pmap code as well
library(purrr)
tib %>%
dplyr::filter(grp==.env$grp_) %>%
dplyr::mutate(the_rest = purrr::pmap_dbl(across(everything()),
~ {gg <- ..1
NN <- ..2
tib %>%
dplyr::filter(grp != gg, N == NN) %>%
pull(val) %>%
sum()}),
no_hugs = purrr::pmap_dbl(across(all_of(names(tib))),
~ tib %>%
dplyr::filter(grp != .env$grp, N == ..2) %>%
pull(val) %>%
sum()))
-output
# A tibble: 4 × 5
grp N val the_rest no_hugs
<chr> <int> <dbl> <dbl> <dbl>
1 a 1 0 1 1
2 a 2 1 2 2
3 a 3 1 1 1
4 a 4 0 1 1

Create parameterized summaries of a column

I have a tibble and I want create several summaries of the same column, specifically the first, second and third quartiles.
To do it, I create a named list of functions and that works fine.
library("tidyverse")
set.seed(1234)
df <- tibble(x = rnorm(100))
df %>%
summarise(
across(x,
list(
Q1 = ~ quantile(., 1 / 4),
Q2 = ~ quantile(., 2 / 4),
Q3 = ~ quantile(., 3 / 4)
),
.names = "{.fn}"
)
)
#> # A tibble: 1 × 3
#> Q1 Q2 Q3
#> <dbl> <dbl> <dbl>
#> 1 -0.895 -0.385 0.471
Can I achieve this by specifying the list of probabilities to pass to quantile? So that I save myself typing and more importantly avoid hard-coding the arguments to pass to the aggregating function.
The following doesn't work because it creates one row per probability rather than one column.
df %>%
summarise(
across(x, quantile, 1:3 / 4)
)
#> # A tibble: 3 × 1
#> x
#> <dbl>
#> 1 -0.895
#> 2 -0.385
#> 3 0.471
you're almost here
df <- tibble(x = rnorm(100))
df %>%
summarise(
across(x,
map(1:3, ~partial(quantile, probs=./4)),
.names = "Q{.fn}"
)
)
# A tibble: 1 x 3
Q1 Q2 Q3
<dbl> <dbl> <dbl>
1 -0.579 0.0815 0.475
If you define the quantiles like this:
Q <- c(0.25, 0.5, 0.75)
Then the following code will produce columns of the appropriate quantiles with sensible labels:
df %>%
summarise(
across(x,
setNames( lapply(Q,
function(x) { f <- ~quantile(., b); f[2][[1]][[3]] <- x; f }),
paste("Q", round(100 * Q), sep = "_")),
.names = "{.fn}"
)
)
#> # A tibble: 1 x 3
#> Q_25 Q_50 Q_75
#> <dbl> <dbl> <dbl>
#> 1 -0.895 -0.385 0.471
Created on 2022-06-29 by the reprex package (v2.0.1)

Programmatically count grouped data using logic rules and string

I have a grouped data frame which I want to summarise into "count of values less than x, y, z by group". I can manually generate the wide dataframe I want using code similar to this below
library(tidyverse)
set.seed(1337)
df <- data.frame(cbind(group = seq(1:5), num = sample(x = 1:400, size = 100, replace = T)))
manual <- df %>%
group_by(group) %>%
summarise(less_than_50 = sum(num < 50),
less_than_100 = sum(num < 100),
less_than_150 = sum(num < 150))
However, I'd like to be able to define a list of "less thans" and generate these columns by referring to a list. I've done something similar in the past, though using enframe(quantile()) to generate a long list of quantiles before pivoting
pc <- c(0.1, 0.5, 0.9)
quantiles <- df %>%
group_by(group) %>%
summarise(enframe(quantile(num, pc))) %>%
pivot_wider(
id_cols = group,
names_from = name,
values_from = value
)
But I don't know / understand the way to define a custom function within the enframe(). Ideally I'd like to apply this in something like the code below (though this obviously doesn't work), with or without the pivot step, in order to get back to the same output as "manual"
levels <- c(50, 100, 150)
programmatic <- df %>%
group_by(group) %>%
summarise(cols = ("less_than", x), num < levels) %>%
pivot...
Any help greatly appreciated
One way you could do it:
library(tidyverse)
set.seed(1337)
df <- data.frame(cbind(group = seq(1:5), num = sample(x = 1:400, size = 100, replace = T)))
less_than <- function(x) {
df %>%
group_by(group) %>%
summarise(less_than_ = sum(num < x)) %>%
rename_with(~ str_c(., x), .cols = -group)
}
levels <- c(50, 100, 150)
map_dfr(levels, less_than) |>
group_by(group) |>
summarise(across(everything(), mean, na.rm = TRUE))
#> # A tibble: 5 × 4
#> group less_than_50 less_than_100 less_than_150
#> <int> <dbl> <dbl> <dbl>
#> 1 1 4 5 10
#> 2 2 2 2 5
#> 3 3 2 6 11
#> 4 4 4 5 5
#> 5 5 1 7 9
# Manual result for comparison
df %>%
group_by(group) %>%
summarise(less_than_50 = sum(num < 50),
less_than_100 = sum(num < 100),
less_than_150 = sum(num < 150))
#> # A tibble: 5 × 4
#> group less_than_50 less_than_100 less_than_150
#> <int> <int> <int> <int>
#> 1 1 4 5 10
#> 2 2 2 2 5
#> 3 3 2 6 11
#> 4 4 4 5 5
#> 5 5 1 7 9
Created on 2022-06-06 by the reprex package (v2.0.1)

Omitting columns instead of dropping them in purrr

I need to calculate an index for multiple lists. However, I can only do this if I drop some columns (here represented by "w" and "x"). For ex.
library(tidyverse)
lists<- list(
l1=tribble(
~w, ~x, ~y, ~z,
#--|--|--|----
12, "a", 2, 1,
12, "a",5, 3,
12, "a",6, 2),
l2=tribble(
~w, ~x, ~y, ~z,
#--|--|--|----
13,"b", 5, 7,
13,"b", 4, 6,
13,"b", 3, 2))
lists %>%
map(~ .x %>%
#group_by(w,x) %>%
select(-w,-x) %>%
mutate(row_sums = rowSums(.)))
Instead of dropping those columns I would like to keep/omit them and calculate the index only for "y" and "z".
I manage to do this by first extracting those columns and binding them again afterward. For ex.
select.col<-lists %>%
map_dfr(~ .x %>%
select(w,x))
lists %>%
map_dfr(~ .x %>%
select(-w,-x) %>%
mutate(row_sums = rowSums(.))) %>%
bind_cols(select.col)
However, this is not so elegant and I had to bind the lists (map_dfr), I would like to keep them as a list though.
Probably, another approach would be to use select_if(., is.numeric), but as I have some numeric columns I need to omit, I'm not sure whether this is the best option.
I'm certain there is a simple solution to this problem. Can anyone take a look at it?
Instead of dropping the columns, you can select the columns for which you want to take the sum.
You can select by name
library(dplyr)
library(purrr)
lists %>% map(~ .x %>% mutate(row_sums = rowSums(.[c("y", "z")])))
#$l1
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 12 a 2 1 3
#2 12 a 5 3 8
#3 12 a 6 2 8
#$l2
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 13 b 5 7 12
#2 13 b 4 6 10
#3 13 b 3 2 5
Or also by position of columns
lists %>% map(~ .x %>% mutate(row_sums = rowSums(.[3:4])))
Here is a tidyverse approach to get the row sums
library(tidyverse)
lists %>%
map(~ .x %>%
mutate(row_sums = select(., y:z) %>%
reduce(`+`)))
#$l1
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 12 a 2 1 3
#2 12 a 5 3 8
#3 12 a 6 2 8
#$l2
# A tibble: 3 x 5
# w x y z row_sums
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 13 b 5 7 12
#2 13 b 4 6 10
#3 13 b 3 2 5
Or using base R
lapply(lists, transform, row_sums = y + z)

dplyr: passing a grouped tibble to a custom function

(The following scenario simplifies my actual situation)
My data comes from villages, and I would like to summarize an outcome variable by a village variable.
> data
village A Z Y
<chr> <int> <int> <dbl>
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700
For example, I would like to calculate the mean of Y only using Z==z by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".
Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.
z <- 1 # z takes 0 or 1
data %>%
group_by(village) %>% # grouping by village
summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable
Y_hat_village <- function(data_village, z){
# This function takes a part of tibble (`data_village`) and a variable `z`
# Calculate the mean for a specific z in a village
data_z <- data_village %>% filter(Z==get("z"))
return(mean(data_z$Y))
}
However, I found . passes entire tibble and the code above returns the same values for all groups.
There are a couple things you can simplify. One is in your function: since you're passing in a value z to the function, you don't need to use get("z"). You have a z in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val.
library(tidyverse)
z_val <- 1
Y_hat_village2 <- function(data, z) {
data_z <- data %>% filter(Z == z)
return(mean(data_z$Y))
}
You can make the function call on each group using do, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val to the argument z.
df %>%
group_by(village) %>%
do(y_hat = Y_hat_village2(., z = z_val)) %>%
unnest()
#> # A tibble: 2 x 2
#> village y_hat
#> <chr> <dbl>
#> 1 a 450
#> 2 b 700
However, do is being deprecated in favor of purrr::map, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data, then map over that column and again supply z = z_val. When you unnest the y_hat column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 450
#> 2 b <tibble [2 × 3]> 700
Just to check that everything works okay, I also passed in z = 0 to check for 1. scoping issues, and 2. that other values of z work.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 800
#> 2 b <tibble [2 × 3]> 300
As an extension/modification to #patL's answer, you can also wrap the tidyverse solution within purrr:map to return a list of two tibbles, one for each z value:
z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 800.
#2 b 300.
#
#[[2]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 450.
#2 b 700.
Sample data
df <- read.table(text =
" village A Z Y
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700 ", header = T)
You can use dplyr to accomplish it:
library(dplyr)
df %>%
group_by(village) %>%
filter(Z == 1) %>%
summarise(Y_village = mean(Y))
## A tibble: 2 x 2
# village Y_village
# <chr> <dbl>
#1 a 450
#2 b 700
To get all columns:
df %>%
group_by(village) %>%
filter(Z == 1) %>%
mutate(Y_village = mean(Y)) %>%
distinct(village, A, Z, Y_village)
## A tibble: 2 x 4
## Groups: village [2]
# village A Z Y_village
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 450
#2 b 1 1 700
data
df <- data_frame(village = c("a", "a", "a", "b", "b"),
A = rep(1, 5),
Z = c(1, 1, 0, 0, 1),
Y = c(500, 400, 800, 30, 700))

Resources