Efficient recursive random sampling with groups of unequal size - r

This question is a follow-up to my previous question on recursive random sampling Efficient recursive random sampling. The solutions in that thread work fine when the groups are of identical size or when a fixed number of samples per group is required. However, let's imagine a dataset as follows;
ID1 ID2
1 A 1
2 A 6
3 B 1
4 B 2
5 B 3
6 C 4
7 C 5
8 C 6
9 D 6
10 D 7
11 D 8
12 D 9
where we want to randomly sample up to n ID2 for each ID1, and doing so recursively. Recursively here means that we are moving from the first ID1 to the last ID1, and if an ID2 was already sampled for an ID1, then it should not be used for a subsequent ID1. Let's say n = 2, then expected results would be as follows;
ID1 ID2
1 A 1
2 A 6
4 B 2
5 B 3
6 C 4
7 C 5
11 D 8
12 D 9
For ID1 = "A", there are exactly two potential ID2, so both are selected.
For ID1 = "B", there are two potential ID2 left to select, so both are selected.
For ID1 = "C", there are two potential ID2 left to select, so both are selected.
For ID = "D", there are three potential ID2 left to sample from, so two are randomly selected from those.
What can happen beyond the situation shown in the example;
Every ID1 always has a non-zero number of ID2 available,
however, it is possible that all of those ID2 were already used. In
that case, ID1 should be simply left out.
It is possible that none of ID1 will have the specified n of ID2. In that
case, the n closest to specified n should be retrieved.
ID doesn't have to be seq(ID1).
ID2 could be also a character vector similar to ID1.
Sample df;
df <- structure(list(ID1 = c("A", "A", "B", "B", "B", "C", "C", "C",
"D", "D", "D", "D"), ID2 = c(1, 6, 1, 2, 3, 4, 5, 6, 6, 7, 8,
9)), class = "data.frame", row.names = c(NA, -12L))

The following function seems to give what you are after. Basically, it loops through each group of ID1 and selects the rows where the corresponding ID2 has not been sampled. Then it selects the distinct rows (in the case that some group of ID1 has duplicate ID2 values. The sample size will be the minimum of either n, or the number of rows for that group.
sample <- function(df, n) {
`%notin%` <- Negate(`%in%`)
groups <- unique(df$ID1)
out <- data.frame(ID1 = character(), ID2 = character())
for (group in groups) {
options <- df %>%
filter(ID1 == group,
ID2 %notin% out$ID2)
chosen <- sample_n(options,
size = min(n, nrow(options))) %>%
distinct()
out <- rbind(out, chosen)
}
out
}
set.seed(123)
sample(df, 2)
ID1 ID2
1 A 1
2 A 6
3 B 2
4 B 3
5 C 4
6 C 5
7 D 8
8 D 9
Case where a group of ID1 has ID2s that were already used up:
Input:
# A tibble: 10 × 2
ID1 ID2
<chr> <dbl>
1 A 1
2 A 3
3 B 1
4 B 3
5 C 5
6 C 6
7 C 7
8 C 7
9 D 10
10 D 20
Output:
sample(df2, 2)
# A tibble: 6 × 2
ID1 ID2
<chr> <dbl>
1 A 3
2 A 1
3 C 6
4 C 7
5 D 20
6 D 10

I dont know whether I am oversimplifying the problem. Take a look at the following and see whether it works in your case:
library(tidyverse)
df %>%
group_split(ID1)%>%
reduce(~ bind_rows(.x, .y) %>%
filter(!duplicated(ID2))%>%
group_by(ID1)%>%
slice_sample(n=2) %>%
ungroup,
.init = slice_sample(.[[1]], n=2))
# A tibble: 8 x 2
ID1 ID2
<chr> <dbl>
1 A 1
2 A 6
3 B 2
4 B 3
5 C 4
6 C 5
7 D 9
8 D 8
Disclaimer: NOt vectorized, thus inefficient

Here is a base R option using dynamic programming (DP)
d <- table(df)
nms <- dimnames(d)
res <- list()
for (i in nms$ID1) {
idx <- which(d[i, ] > 0)
if (length(idx) >= 2) {
j <- sample(idx, 2)
res[[i]] <- nms$ID2[j]
d[, j] <- 0
}
}
dfout <- type.convert(
setNames(rev(stack(res)), names(df)),
as.is = TRUE
)
which gives
ID1 ID2
1 A 6
2 A 1
3 B 2
4 B 3
5 C 4
6 C 5
7 D 7
8 D 8
For the case with used ID2 already, e.g.,
> (df <- structure(list(ID1 = c(
+ "A", "A", "B", "B", "B", "C", "C", "C",
+ "D", "D", "D", "D"
+ ), ID2 = c(
+ 1, 3, 1, 2, 3, 3, 4, 5, 4, 5, 6, .... [TRUNCATED]
ID1 ID2
1 A 1
2 A 3
3 B 1
4 B 2
5 B 3
6 C 3
7 C 4
8 C 5
9 D 4
10 D 5
11 D 6
12 D 1
we will obtain
ID1 ID2
1 A 1
2 A 3
3 C 5
4 C 4

Related

Filter groups based on difference two highest values

I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]

Subset a grouped data frame based on range of row position

I have a grouped data frame and I wish to keep for each group (name) the rows in a given range .For ex, between 2nd and 3rd position.
df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9)
df
name x
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 c 6
7 c 7
8 c 8
Here I want an output like this
name x
1 a 2
2 a 3
3 b 5
4 c 7
5 c 8
Thank you,
First, group_by name, then slice from index 2:3:
library(dplyr)
df %>%
group_by(name) %>%
slice(2:3)
# A tibble: 5 x 2
# Groups: name [3]
name x
<chr> <int>
1 a 2
2 a 3
3 b 5
4 c 7
5 c 8
The solution that I found is using dplyr::slice(2:3)

Enumerate a grouping variable in a tibble

I would like to know how to use row_number or anything else to transform a variable group into a integer
tibble_test <- tibble(A = letters[1:10], group = c("A", "A", "A", "B", "B", "C", "C", "C", "C", "D"))
# to get the enumeration inside each group of 'group'
tibble_test %>%
group_by(group) %>%
mutate(G1 = row_number())
But I would like to have this output:
# A tibble: 10 x 4
A group G1 G2
<chr> <chr> <dbl> <dbl>
1 a A 1 1
2 b A 2 1
3 c A 3 1
4 d B 1 2
5 e B 2 2
6 f C 1 3
7 g C 2 3
8 h C 3 3
9 i C 4 3
10 j D 1 4
My question is: how to get this column G2, I know i could transform the 'group' var into a factor then integer (after the tibble is arranged) but I would like to know if it can be done using a counting.
You just need one more step and include the group indices with group_indices(). Be aware that how your data is arranged/sorted will affect the index.
library(dplyr)
tibble_test <- tibble(A = letters[1:10], group = c("A", "A", "A", "B", "B", "C", "C", "C", "C", "D"))
# to get the enumeration inside each group of 'group'
tibble_test %>%
group_by(group) %>%
mutate(G1 = row_number(),
G2 = group_indices())
# A tibble: 10 x 4
# Groups: group [4]
A group G1 G2
<chr> <chr> <int> <int>
1 a A 1 1
2 b A 2 1
3 c A 3 1
4 d B 1 2
5 e B 2 2
6 f C 1 3
7 g C 2 3
8 h C 3 3
9 i C 4 3
10 j D 1 4

How to add a row to data frame based on a condition

I've a dataframe which I want to add a row on the basis of the following conditions. The conditions are when column a is equal to C and column b is equal to 3 or 5.
Here is my dataframe
df <- data.frame(a = c("A", "B", "C", "D", "C", "A", "C", "E"),
b = c(seq(8)), stringsAsFactors = TRUE)
Whenever the condition is TRUE I want to add a row below where the condition is met add 3. I have tried the following
rbind(df, data.frame(a="add", b = "3"))
# a b
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 C 5
# 6 A 6
# 7 C 7
# 8 E 8
# 9 add 3
This is not the output I want. The output I want is
# a b
# 1 A 1
# 2 B 2
# 3 C 3
# 4 add 3
# 5 D 4
# 6 C 5
# 7 add 3
# 8 A 6
# 9 C 7
# 10 E 8
How can I do that? I am new to R and thank you for your help.
lens = ifelse(df$b %in% c(3, 5) & df$a == "C", 2, 1)
ind = rep(1:NROW(df), lens)
df2 = df[ind,]
df2$a = as.character(df2$a)
df2$a[cumsum(lens)[which(lens == 2)]] = "add"
df2$b[cumsum(lens)[which(lens == 2)]] = 3
df2
# a b
#1 A 1
#2 B 2
#3 C 3
#3.1 add 3
#4 D 4
#5 C 5
#5.1 add 3
#6 A 6
#7 C 7
#8 E 8
A solution using the tidyverse package.
library(tidyverse)
df2 <- df %>%
mutate(Group = lag(cumsum(a == "C" & b %in% c(3, 5)), default = FALSE)) %>%
group_split(Group) %>%
map_dfr(~ .x %>% bind_rows(tibble(a = "add", b = 3))) %>%
slice(-n()) %>%
select(-Group)
df2
# # A tibble: 10 x 2
# a b
# <chr> <dbl>
# 1 A 1
# 2 B 2
# 3 C 3
# 4 add 3
# 5 D 4
# 6 C 5
# 7 add 3
# 8 A 6
# 9 C 7
# 10 E 8
In base R, we can find out position where a = "c" and b is 3 or 5. Repeat those rows in the dataframe and replace them with required values.
pos <- which(df$a == "C" & df$b %in% c(3, 5))
df <- df[sort(c(seq(nrow(df)), pos)), ]
df[seq_along(pos) + pos, ] <- list("add", 3)
row.names(df) <- NULL
df
# a b
#1 A 1
#2 B 2
#3 C 3
#4 add 3
#5 D 4
#6 C 5
#7 add 3
#8 A 6
#9 C 7
#10 E 8
data
df <- data.frame(a = c("A", "B", "C", "D", "C", "A", "C", "E"),
b = c(seq(8)), stringsAsFactors = FALSE)

Filter by values that have the exact names given in a list (dplyr)

I have the following data.
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
1 1 a
2 1 b
3 1 a
4 2 a
5 2 b
6 2 c
7 2 b
8 3 a
9 3 b
10 3 c
11 3 d
12 3 d
I would like to filter only by the following list
set <- NULL
set$names <- c("a","b","c")
The ids selected are those that contain exactly the names in the list.
So the result would be only the 2s selected as follows:
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
4 2 a
5 2 b
6 2 c
7 2 b
Here is the data for easy replication:
dat <- tribble(
~id, ~name,
1, "a",
1, "b",
1, "a",
2, "a",
2, "b",
2, "c",
2, "b",
3, "a",
3, "b",
3, "c",
3, "d",
3, "d"
)
I would like to have the following result.
How about:
group_by(dat, id) %>% filter(setequal(name, set$names))
This filters out all groups where the name column and set$names do not contain the same elements, but allows duplicates.
I am not sure it is what you want
dat %>%
group_by(id) %>%
filter(all(set$name %in% name) & all(name %in%set$name))
# A tibble: 4 x 2
id name
<dbl> <chr>
1 2 a
2 2 b
3 2 c
4 2 b

Resources