Temporarily store variable in series of pipes dplyr - r

Is there a way to pause a series of pipes to store a temporary variable that can be used later on in pipe sequence?
I found this question but I'm not sure that it was doing the same thing I am looking for.
Here's a sample dataframe:
library(dplyr)
set.seed(123)
df <- tibble(Grp = c("Apple","Boy","Cat","Dog","Edgar","Apple","Boy","Cat","Dog","Edgar"),
a = sample(0:9, 10, replace = T),
b = sample(0:9, 10, replace = T),
c = sample(0:9, 10, replace = T),
d = sample(0:9, 10, replace = T),
e = sample(0:9, 10, replace = T),
f = sample(0:9, 10, replace = T),
g = sample(0:9, 10, replace = T))
I am going to convert df to long format but, after having done so, I will need to apply the number of rows before the gather.
This is what my desired output looks like. In this case, storing the number of rows before the pipe begins would look like:
n <- nrow(df)
df %>%
gather(var, value, -Grp) %>%
mutate(newval = value * n)
# A tibble: 70 x 4
Grp var value newval
<chr> <chr> <int> <int>
1 Apple a 2 20
2 Boy a 7 70
3 Cat a 4 40
4 Dog a 8 80
5 Edgar a 9 90
6 Apple a 0 0
7 Boy a 5 50
8 Cat a 8 80
9 Dog a 5 50
10 Edgar a 4 40
# ... with 60 more rows
In my real world problem, I have a long chain of pipes and it would be a lot easier if I could perform this action within the pipe structure. I would like to do something that looks like this:
df %>%
{ "n = nrow(.)" } %>% # temporary variable is created here but df is passed on
gather(var, value, -Grp) %>%
mutate(newval = value * n)
I could do something like the following, but it seems really sloppy.
df %>%
mutate(n = nrow(.)) %>%
gather(var, value, -Grp, -n) %>%
mutate(newval = value * mean(n))
Is there a way to do this or perhaps a good workaround?

You could use a code block for a local variable. This would look like
df %>%
{ n = nrow(.)
gather(., var, value, -Grp) %>%
mutate(newval = value * n)
}
Notice how we have to pass the . to gather as well here and the pipe continues inside the block. But you could put other parts afterwards
df %>%
{ n = nrow(.)
gather(., var, value, -Grp) %>%
mutate(newval = value * n)
} %>%
select(newval)

Here is an option with %>>% (pipe operator) from pipeR
library(pipeR)
library(dplyr)
library(tidyr)
df %>>%
(~ n = nrow(.)) %>%
gather(., var, value, -Grp) %>%
mutate(newval = value * n)
# A tibble: 70 x 4
# Grp var value newval
# <chr> <chr> <int> <int>
# 1 Apple a 2 20
# 2 Boy a 7 70
# 3 Cat a 4 40
# 4 Dog a 8 80
# 5 Edgar a 9 90
# 6 Apple a 0 0
# 7 Boy a 5 50
# 8 Cat a 8 80
# 9 Dog a 5 50
#10 Edgar a 4 40
# … with 60 more rows

Related

lag() with group_by between current and last observation in R

Edit: I found the solution with na.locf().
data <-
data %>%
group_by(country) %>%
arrange(wave) %>%
mutate(weight.io = na.locf(weight)) %>%
mutate(lag_weight = weight - lag(weight.io)
I have a dataset below.
set.seed(42000)
data <- data_frame(
country = sample(letters[1:20], size = 100, replace = TRUE),
weight = round(runif(100, min = 48, max = 90)))
data <- data %>%
group_by(country) %>%
arrange(weight) %>%
mutate(wave = seq_along(weight))
n_rows <- nrow(data)
perc_missing <- 10
data[sample(1:n_rows, sample(1:n_rows, round(perc_missing/100 * n_rows, 0))), c("weight")] <- NA
I would like to obtain the difference between one country's current "weight" and the last observed "weight for each wave.
For country "a" wave 5, I want the value to be 69 - 65 (last observed weight at wave < 5).
And for wave 8, 82(weight at wave 8) - 69(weight at wave 5).
My approach was the one below, but it didn't work.
data <-
data %>%
group_by(country) %>%
arrange(wave) %>%
mutate(lag_weight = weight - lag(weight, default = first(weight, na.rm = TRUE)))
Thank you!
I think this is a combination of diff (instead of lag, though that could work just as well) and more important tidyr::fill (or zoo::na.locf, not demonstrated):
BTW, na.rm= is not an argument for first, I've removed it.
library(dplyr)
# library(tidyr) # fill
data %>%
group_by(country) %>%
tidyr::fill(weight) %>%
filter(country == "a") %>%
mutate(lag_weight = weight - lag(weight, default = first(weight)))
# # A tibble: 10 x 4
# # Groups: country [1]
# country weight wave lag_weight
# <chr> <dbl> <int> <dbl>
# 1 a 54 1 0
# 2 a 55 2 1
# 3 a 65 3 10
# 4 a 65 4 0
# 5 a 69 5 4
# 6 a 69 6 0
# 7 a 69 7 0
# 8 a 82 8 13
# 9 a 82 9 0
# 10 a 85 10 3
The issue here is that weight is over-written with the LOCF (last-observation carried forward) value instead of preserving the NA values. If that's important, then you can make another weight variable for temporary use (and remove it):
data %>%
mutate(tmpweight = weight) %>%
group_by(country) %>%
tidyr::fill(tmpweight) %>%
filter(country == "a") %>%
mutate(lag_weight = tmpweight - lag(tmpweight, default = first(tmpweight))) %>%
select(-tmpweight)
# # A tibble: 10 x 4
# # Groups: country [1]
# country weight wave lag_weight
# <chr> <dbl> <int> <dbl>
# 1 a 54 1 0
# 2 a 55 2 1
# 3 a 65 3 10
# 4 a NA 4 0
# 5 a 69 5 4
# 6 a NA 6 0
# 7 a NA 7 0
# 8 a 82 8 13
# 9 a 82 9 0
# 10 a 85 10 3
FYI, you can use c(0, diff(weight)) instead of weight - lag(weight) for the same effect. Since it returns length of 1 shorter (since it is the gap between each value), we prepend a 0 here:
data %>%
group_by(country) %>%
tidyr::fill(weight) %>%
filter(country == "a") %>%
mutate(lag_weight = c(0, diff(weight)))
(The filter(country == "a") is purely for demonstration to match your example, not that it is required for this solution.)

How to randomly filter rows to achieve desired proportions of a grouping variable

I have data with a group variable, and I want to sample rows to end up with certain proportions in the group variable. This might require filtering rows, as the following example shows.
Simulating data
set.seed(2021)
my_df <-
data.frame(animal = sample(rep(c("dog", "cat", "rabbit"), times = c(150, 4100, 220))),
weight = sample(5:25, size = 4470, replace = TRUE))
> head(my_df)
## animal weight
## 1 cat 11
## 2 cat 24
## 3 cat 9
## 4 cat 20
## 5 cat 11
## 6 rabbit 9
Here we have data about 4470 animals, which could be either cat, dog, or rabbit, and the weight of each individual animal.
If we summarize the proportions of animals of each type we get:
library(dplyr)
my_df %>%
group_by(animal) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
## # A tibble: 3 x 3
## animal n freq
## * <chr> <int> <dbl>
## 1 cat 4100 0.917
## 2 dog 150 0.0336
## 3 rabbit 220 0.0492
We thus learned that in my_df, 91.7% of the data are cats, 4.92% are rabbits, and 3.36% are dogs.
Desired Output: Sampling rows from the data to end up with other proportions over animal column
I realized that my data in my_df is not representative of the population I study, and therefore I want to sample rows to alter the proportions.
I want to end up with data that is comprised of 70% cats, 15% dogs, and 15% rabbits. Obviously, I'll need to throw away many of the cat rows to reach such distribution.
Is there a simple way to reach such random sampling, to meet a desired proportion over a grouping variable?
EDIT
To clarify, in my_df reaching the desired proportions between cat:dog:rabbit requires throwing away not only cats, but potentially dogs and rabbits too.
EDIT 2
In the comments, #Limey suggested this post, which is indeed relevant. However, I've tried applying this solution from there but it didn't give the expected output.
library(purrr)
group_slice_prop <- c(cat = 0.7, dog = 0.15, rabbit = 0.15)
output <-
my_df %>%
split(.$animal) %>%
imap_dfr(~ slice_sample(.x, prop = group_slice_prop[.y]))
We get that output is:
output %>%
group_by(animal) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
## # A tibble: 3 x 3
## animal n freq
## * <chr> <int> <dbl>
## 1 cat 2870 0.981
## 2 dog 22 0.00752
## 3 rabbit 33 0.0113
But I was expecting to summarize output and get:
# A tibble: 3 x 3
animal n freq
* <chr> <int> <dbl>
1 cat ? 0.70
2 dog ? 0.15
3 rabbit ? 0.15
EDIT 3
Both #AnilGoyal and #Chris Ruehlemann proposed solutions that do work in this case, but are somewhat limited to the toy data I provided. We could have thought about other scenarios with different, less intuitive proportions, or otherwise more levels in the group variable that require different math to figure out the n per group. I want to avoid that. I wish to specify the desired blend of proportions in the group variable, and let the code decide how many rows to throw away from each group category to reach that blend.
EDITed answer in view of EDIT-3
#desired sample sizes
samp <- tibble(animal = c('cat', 'dog', 'rabbit'),
prop = c(0.70, 0.15, 0.15))
arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
mutate(n1 = first(n)/first(prop),
n = prop * n1) %>% select(-prop, -n1) %>%
right_join(my_df, by = "animal") %>%
group_split(animal) %>%
map_df(~sample_n(.x, size = first(n))) %>%
select(-n)
# A tibble: 1,000 x 2
animal weight
<chr> <int>
1 cat 19
2 cat 7
3 cat 17
4 cat 11
5 cat 22
6 cat 8
7 cat 22
8 cat 14
9 cat 22
10 cat 18
# ... with 990 more rows
Try this out on different df
set.seed(123)
my_df <-
data.frame(animal = sample(rep(c("dog", "cat", "rabbit"), times = c(1500, 4100, 220))),
weight = sample(5:25, size = 5820, replace = TRUE))
library(tidyverse)
samp <- tibble(animal = c('cat', 'dog', 'rabbit'),
prop = c(0.70, 0.15, 0.15))
arrange(count(my_df, animal), n) %>% left_join(samp, by = "animal") %>%
mutate(n1 = first(n)/first(prop),
n = prop * n1) %>% select(-prop, -n1) %>%
right_join(my_df, by = "animal") %>%
group_split(animal) %>%
map_df(~sample_n(.x, size = first(n))) %>%
select(-n) -> sampled
library(janitor)
tabyl(sampled$animal)
sampled$animal n percent
cat 1026 0.6998636
dog 220 0.1500682
rabbit 220 0.1500682
You can create a tibble with animal name and target percentage that you need in the final dataframe and sample rows for each animal based on it.
library(dplyr)
tibble(animal = c('cat', 'dog', 'rabbit'),
prop = c(0.70, 0.15, 0.15),
n = nrow(my_df) * prop) %>%
left_join(my_df, by = 'animal') %>%
group_by(animal) %>%
sample_n(size = first(n), replace = TRUE) %>%
ungroup %>%
select(-prop, -n) -> result
Checking the proportion :
result %>% count(animal) %>% mutate(n = prop.table(n))
# animal n
# <chr> <dbl>
#1 cat 0.700
#2 dog 0.150
#3 rabbit 0.150
Here's a step-wise solution basd on the fact noted in a comment that " there are 150 dogs and 220 rabbits" and that dogs and rabbits are supposed to account for 15%:
cats <- my_df %>%
filter(animal == "cat") %>%
sample_n(700)
rabbits <- my_df %>%
filter(animal == "rabbit") %>%
sample_n(150)
dogs <- my_df %>%
filter(animal == "dog")
my_newdf <- rbind(cats, rabbits, dogs)
Check:
my_newdf %>%
group_by(animal) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# A tibble: 3 x 3
animal n freq
* <chr> <int> <dbl>
1 cat 700 0.7
2 dog 150 0.15
3 rabbit 150 0.15

How to use fct_lump() to get the top n levels by group and put the rest in 'other'?

I'm trying to find the top 3 factor levels within each group, based on an aggregating variable, and group the remaining factor levels into "other" for each group. Normally I'd use fct_lump_n for this, but I can't figure out how to make it work within each group.
Here's an example, where I want to form groups based on the x variable, order the y variables based on the value of z, choose the first 3 y variables, and group the rest of y into "other":
set.seed(50)
df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
y = factor(sample(letters[1:10], 100, replace = T)),
z = sample(100, 100, replace = T))
I've tried doing this:
df %>%
group_by(x) %>%
arrange(desc(z), .by_group = T) %>%
slice_head(n = 3)
which returns this:
# A tibble: 9 x 3
# Groups: x [3]
x y z
<fct> <fct> <int>
1 r i 95
2 r c 92
3 r a 88
4 s g 94
5 s g 92
6 s f 92
7 t j 100
8 t d 93
9 t i 81
This is basically what I want, but I'm missing the 'other' variable within each of r, s, and t, which collects the values of z which have not been counted.
Can I use fct_lump_n for this? Or slice_head combined with grouping the excluded variables into "other"?
Tried in R 4.0.0 and tidyverse 1.3.0:
set.seed(50)
df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
y = factor(sample(letters[1:10], 100, replace = T)),
z = sample(100, 100, replace = T))
df %>%
group_by(x) %>%
arrange(desc(z)) %>%
mutate(a = row_number(-z)) %>%
mutate(y = case_when(a > 3 ~ "Other", TRUE ~ as.character(y))) %>%
mutate(a = case_when(a > 3 ~ "Other", TRUE ~ as.character(a))) %>%
group_by(x, y, a) %>%
summarize(z = sum(z)) %>%
arrange(x, a) %>%
select(-a)
Output:
# A tibble: 12 x 3
# Groups: x, y [11]
x y z
<fct> <chr> <int>
1 r b 92
2 r j 89
3 r g 83
4 r Other 749
5 s i 93
6 s h 93
7 s i 84
8 s Other 1583
9 t a 99
10 t b 98
11 t i 95
12 t Other 1508
Note: the use of variable a together with y is to compensate the fact that y is sampled with replacement (see row 5 and 7 of output). If I don't use a, row 5 and 7 of output will have their z summed up. Also note that I try to solve the problem posed, but I left y as character, since I suppose those "Other"s are not meant to be one same factor level.

map over columns and apply custom function

Missing something small here and struggling to pass columns to function. I just want to map (or lapply) over columns and perform a custom function on each of the columns. Minimal example here:
library(tidyverse)
set.seed(10)
df <- data.frame(id = c(1,1,1,2,3,3,3,3),
r_r1 = sample(c(0,1), 8, replace = T),
r_r2 = sample(c(0,1), 8, replace = T),
r_r3 = sample(c(0,1), 8, replace = T))
df
# id r_r1 r_r2 r_r3
# 1 1 0 0 1
# 2 1 0 0 1
# 3 1 1 0 1
# 4 2 1 1 0
# 5 3 1 0 0
# 6 3 0 0 1
# 7 3 1 1 1
# 8 3 1 0 0
a function just to filter and counts unique ids remaining in the dataset:
cnt_un <- function(var) {
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
it works outside of map
cnt_un(r_r1)
# A tibble: 1 x 2
r_r1 n_uniq
<dbl> <int>
1 1 3
I want to apply the function over all r_r columns to get something like:
df2
# y n_uniq
# 1 r_r1 3
# 2 r_r2 2
# 3 r_r3 2
I thought the following would work but doesnt
map(dplyr::select(df, matches("r_r")), ~ cnt_un(.x))
any suggestions? thanks
I'm not sure if there's a direct tidyeval way to do this with something like map. The issue you're running into is that in calling map(df, *whatever_function*), the function is being called on each column of df as a vector, whereas your function expects a bare column name in the tidyeval style. To verify that:
map(df, class)
will return "numeric" for each column.
An alternative is to iterate over column names as strings, and convert those to symbols; this takes just one additional line in the function.
library(dplyr)
library(tidyr)
library(purrr)
cnt_un_name <- function(varname) {
var <- ensym(varname)
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
Calling the function is a little awkward because it keeps only the relevant column names (calling on "r_r1" gets columns "r_r1" and "n_uniq", etc). One way is to get the vector of column names you want, name it so you can add an ID column in map_dfr, and drop the extra columns, since they'll be mostly NA.
grep("^r_r\\d+", names(df), value = TRUE) %>%
set_names() %>%
map_dfr(cnt_un_name, .id = "y") %>%
select(y, n_uniq)
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 3
#> 2 r_r2 2
#> 3 r_r3 2
A better way is to call the function, then bind after reshaping.
grep("^r_r\\d+", names(df), value = TRUE) %>%
map(cnt_un_name) %>%
map_dfr(pivot_longer, 1, names_to = "y") %>%
select(y, n_uniq)
# same output as above
Alternatively (and maybe better/more scaleable) would be to do the column renaming inside the function definition.
Here's a base R solution that uses lapply. The tricky bit is that your function isn't actually running on single columns; it's using id, too, so you can't use canned functions that iterate column-wise.
do.call(rbind, lapply(grep("r_r", colnames(df), value = TRUE), function(i) {
X <- subset(df, df[,i] == 1)
row <- data.frame(y = i, n_uniq = length(unique(X$id)), stringsAsFactors = FALSE)
}))
y n_uniq
1 r_r1 2
2 r_r2 3
3 r_r3 2
Here is another solution. I changed the syntax of your function. Now you supply the pattern of the columns you want to select.
cnt_un <- function(var_pattern) {
df %>%
pivot_longer(cols = contains(var_pattern), values_to = "vals", names_to = "y") %>%
filter(vals == 1) %>%
group_by(y) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
cnt_un("r_r")
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 2
#> 2 r_r2 3
#> 3 r_r3 2

R Filter to remove rows within map function

I am simulating events from the following data table using the map function and filtering zero value events.
However I would like to filter within the map function, thereby reducing the size of the event table that gets created.
The following simulates events based on the Poisson distribution for a given mean (it includes freq = 0 but to manage memory I don't want these):
library(tidyverse)
set.seed(1); n <- 10
data <- tibble(locid = seq(5), exp = 2)
event <- data %>%
mutate(freq = map(exp, ~rpois(n, .x))) %>%
mutate(freq = map(freq, ~ data.frame(freq = .x, sim = seq_along(.x)))) %>%
unnest()
I can then filter with event %>% filter(freq != 0). How can I slot this into the map function please? This will make the memory footprint a lot more manageable for my code. Thank you!
An option would be discard
library(tidyverse)
data %>%
mutate(freq = map(exp, ~rpois(n, .x) %>%
discard(. == 0) %>%
tibble(freq = ., sim = seq_along(.)))) %>%
unnest
if 'sim' should be based on the original sequence, then create a tibble of 'rpois' output and the sequence of the elements, then do the filter within map
data %>%
mutate(freq = map(exp, ~ rpois(n , .x) %>%
tibble(freq = ., sim = seq_along(.)) %>%
filter(freq != 0))) %>%
unnest
Or using mutate in between
data %>%
mutate(freq = map(exp, ~ tibble(freq = rpois(n, .x)) %>%
mutate(sim = row_number()) %>%
filter(freq != 0))) %>%
unnest
Here is one idea. No need to create data.frame. Create list with freq and sim, and then unnest them.
library(tidyverse)
set.seed(1); n <- 10
data <- tibble(locid = seq(5), exp = 2)
event <- data %>%
mutate(freq = map(exp, ~rpois(n, .x)),
sim = map(freq, ~which(.x > 0)),
freq = map(freq, ~.x[.x > 0]))%>%
unnest()
event
# # A tibble: 45 x 4
# locid exp freq sim
# <int> <dbl> <int> <int>
# 1 1 2 1 1
# 2 1 2 1 2
# 3 1 2 2 3
# 4 1 2 4 4
# 5 1 2 1 5
# 6 1 2 4 6
# 7 1 2 4 7
# 8 1 2 2 8
# 9 1 2 2 9
# 10 2 2 1 1
# # ... with 35 more rows

Resources