Enumerate a grouping variable in a tibble - r

I would like to know how to use row_number or anything else to transform a variable group into a integer
tibble_test <- tibble(A = letters[1:10], group = c("A", "A", "A", "B", "B", "C", "C", "C", "C", "D"))
# to get the enumeration inside each group of 'group'
tibble_test %>%
group_by(group) %>%
mutate(G1 = row_number())
But I would like to have this output:
# A tibble: 10 x 4
A group G1 G2
<chr> <chr> <dbl> <dbl>
1 a A 1 1
2 b A 2 1
3 c A 3 1
4 d B 1 2
5 e B 2 2
6 f C 1 3
7 g C 2 3
8 h C 3 3
9 i C 4 3
10 j D 1 4
My question is: how to get this column G2, I know i could transform the 'group' var into a factor then integer (after the tibble is arranged) but I would like to know if it can be done using a counting.

You just need one more step and include the group indices with group_indices(). Be aware that how your data is arranged/sorted will affect the index.
library(dplyr)
tibble_test <- tibble(A = letters[1:10], group = c("A", "A", "A", "B", "B", "C", "C", "C", "C", "D"))
# to get the enumeration inside each group of 'group'
tibble_test %>%
group_by(group) %>%
mutate(G1 = row_number(),
G2 = group_indices())
# A tibble: 10 x 4
# Groups: group [4]
A group G1 G2
<chr> <chr> <int> <int>
1 a A 1 1
2 b A 2 1
3 c A 3 1
4 d B 1 2
5 e B 2 2
6 f C 1 3
7 g C 2 3
8 h C 3 3
9 i C 4 3
10 j D 1 4

Related

Reorder one row in tibble - move it to the last row

How do I rearrange the rows in tibble?
I wish to reorder rows such that: row with x = "c" goes to the bottom of the tibble, everything else remains same.
library(dplyr)
tbl <- tibble(x = c("a", "b", "c", "d", "e", "f", "g", "h"),
y = 1:8)
An alternative to dplyr::arrange(), using base R:
tbl[order(tbl$x == "c"), ] # Thanks to Merijn van Tilborg
Output:
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
tbl |> dplyr::arrange(x == "c")
Using forcats, convert to factor having c the last, then arrange. This doesn't change the class of the column x.
library(forcats)
tbl %>%
arrange(fct_relevel(x, "c", after = Inf))
# # A tibble: 8 x 2
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
If the order of x is important, it is better to keep it as factor class, below will change the class from character to factor with c being last:
tbl %>%
mutate(x = fct_relevel(x, "c", after = Inf)) %>%
arrange(x)

Calculation of the cumulative points before the event/game

I would like to cumulate the points of several football clubs for each one for a match day.
I have created a sample dataset to explain the problem:
t <- data.frame(Heim = c("A", "B", "B", "D", "C", "A", "C", "D", "A", "B", "B", "D", "C", "A", "C", "D"),
Auswärts = c("C", "D", "A", "C", "B", "D", "A", "B", "C", "D", "A", "C", "B", "D", "A", "B"),
Ergebnis= c("S", "U", "N", "N", "S", "S", "N", "U", "N", "S", "N", "U", "S", "S", "U", "U"),
Round = c(1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4),
Saison = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2))
My idea was for each club (in the origin data set more than 4) a separate column with the score and a cummulated column to it.
So something like this:
t$A_Points <- ifelse(t$Heim =="A" & t$Ergebnis =="S", 3, 0)
t$A_Points <- ifelse(t$Heim =="A" & t$Ergebnis =="U", 1, t$A_Points )
t$A_Points <- ifelse(t$Auswärts =="A" & t$Ergebnis =="U", 1, t$A_Points )
t$A_Points <- ifelse(t$Auswärts =="A" & t$Ergebnis =="N", 3, t$A_Points )
t$A_Points <- ifelse(t$Auswärts !="A" & t$Heim !="A", NA, t$A_Points)
t$A<- ifelse(t$A_Points == "NA", 0, 1)
t<- t %>%
arrange(Saison,Round,A) %>%
group_by(Saison, A) %>%
mutate(cumsum = cumsum(A_Points))
Unfortunately, it is very time and space-consuming even for 4 clubs...
Also, I would like to have the sum of the points of the matches without the result of the current matchday.
The optimal result for me would be following:
Heim Auswärts Ergebnis Round Saison Points_Heim Points_Auswärts
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A C S 1 1 0 0
2 B D U 1 1 0 0
3 B A N 2 1 1 3
4 D C N 2 1 1 0
5 A D S 3 1 6 1
6 C B S 3 1 3 1
7 C A N 4 1 6 9
8 D B U 4 1 1 1
9 A C N 1 2 0 0
10 B D S 1 2 0 0
11 B A N 2 2 3 0
12 D C U 2 2 0 3
13 A D S 3 2 3 1
14 C B S 3 2 4 3
15 C A U 4 2 7 3
16 D B U 4 2 1 3
I would be very happy about an idea for an easier solution.
Probably not the shortest solution. But I would do
t <- t %>%
group_by(Saison) %>%
mutate(Heim_Points_Veränderung = case_when(Ergebnis == "S" ~ 3,
Ergebnis == "U" ~ 1,
Ergebnis == "N" ~ 0),
Auswärts_Points_Veränderung = case_when(Ergebnis == "S" ~ 0,
Ergebnis == "U" ~ 1,
Ergebnis == "N" ~ 3),
Points_Heim = 0,
Points_Auswärts = 0)
for (i in unique(union(t$Heim, t$Auswärts))){
t <- t %>%
mutate(!!sym(paste0(i,"_points")) := if_else(Heim == i, Heim_Points_Veränderung, 0),
!!sym(paste0(i,"_points")) := if_else(Auswärts == i, Auswärts_Points_Veränderung, !!sym(paste0(i,"_points"))),
!!sym(paste0(i,"_cumsum")) := cumsum(lag(!!sym(paste0(i,"_points")), default=0)),
Points_Heim = if_else(Heim == i, !!sym(paste0(i,"_cumsum")), Points_Heim),
Points_Auswärts = if_else(Auswärts == i, !!sym(paste0(i,"_cumsum")), Points_Auswärts))
}
t <- t %>%
select(Heim, Auswärts, Ergebnis, Round, Saison, Points_Heim, Points_Auswärts)
Output
> t
# A tibble: 16 x 7
# Groups: Saison [2]
Heim Auswärts Ergebnis Round Saison Points_Heim Points_Auswärts
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A C S 1 1 0 0
2 B D U 1 1 0 0
3 B A N 2 1 1 3
4 D C N 2 1 1 0
5 C B S 3 1 3 1
6 A D S 3 1 6 1
7 C A N 4 1 6 9
8 D B U 4 1 1 1
9 A C N 1 2 0 0
10 B D S 1 2 0 0
11 B A N 2 2 3 0
12 D C U 2 2 0 3
13 C B S 3 2 4 3
14 A D S 3 2 3 1
15 C A U 4 2 7 6
16 D B U 4 2 1 3
This solution should adapt to any number of clubs. Briefly, I store the possible change in points in Heim/Auswärts_Points_Veränderung using case_when (easier than a lot of ifelse) so I can create a column of points change for each club (running a for loop on all the clubs). This allows me to do a cumsum like you using lag to make sure the cumsum is updated 1 row later (to display the cumulative points before the match and not after), which I enter in the Points_Heim/Auswärts column only when the club is displayed in the Heim/Auswärts columns. The key to my solution is to use !!sym to feed dynamic variable names to mutate (note the assignment with :=) inside the for loop.

Efficient recursive random sampling with groups of unequal size

This question is a follow-up to my previous question on recursive random sampling Efficient recursive random sampling. The solutions in that thread work fine when the groups are of identical size or when a fixed number of samples per group is required. However, let's imagine a dataset as follows;
ID1 ID2
1 A 1
2 A 6
3 B 1
4 B 2
5 B 3
6 C 4
7 C 5
8 C 6
9 D 6
10 D 7
11 D 8
12 D 9
where we want to randomly sample up to n ID2 for each ID1, and doing so recursively. Recursively here means that we are moving from the first ID1 to the last ID1, and if an ID2 was already sampled for an ID1, then it should not be used for a subsequent ID1. Let's say n = 2, then expected results would be as follows;
ID1 ID2
1 A 1
2 A 6
4 B 2
5 B 3
6 C 4
7 C 5
11 D 8
12 D 9
For ID1 = "A", there are exactly two potential ID2, so both are selected.
For ID1 = "B", there are two potential ID2 left to select, so both are selected.
For ID1 = "C", there are two potential ID2 left to select, so both are selected.
For ID = "D", there are three potential ID2 left to sample from, so two are randomly selected from those.
What can happen beyond the situation shown in the example;
Every ID1 always has a non-zero number of ID2 available,
however, it is possible that all of those ID2 were already used. In
that case, ID1 should be simply left out.
It is possible that none of ID1 will have the specified n of ID2. In that
case, the n closest to specified n should be retrieved.
ID doesn't have to be seq(ID1).
ID2 could be also a character vector similar to ID1.
Sample df;
df <- structure(list(ID1 = c("A", "A", "B", "B", "B", "C", "C", "C",
"D", "D", "D", "D"), ID2 = c(1, 6, 1, 2, 3, 4, 5, 6, 6, 7, 8,
9)), class = "data.frame", row.names = c(NA, -12L))
The following function seems to give what you are after. Basically, it loops through each group of ID1 and selects the rows where the corresponding ID2 has not been sampled. Then it selects the distinct rows (in the case that some group of ID1 has duplicate ID2 values. The sample size will be the minimum of either n, or the number of rows for that group.
sample <- function(df, n) {
`%notin%` <- Negate(`%in%`)
groups <- unique(df$ID1)
out <- data.frame(ID1 = character(), ID2 = character())
for (group in groups) {
options <- df %>%
filter(ID1 == group,
ID2 %notin% out$ID2)
chosen <- sample_n(options,
size = min(n, nrow(options))) %>%
distinct()
out <- rbind(out, chosen)
}
out
}
set.seed(123)
sample(df, 2)
ID1 ID2
1 A 1
2 A 6
3 B 2
4 B 3
5 C 4
6 C 5
7 D 8
8 D 9
Case where a group of ID1 has ID2s that were already used up:
Input:
# A tibble: 10 × 2
ID1 ID2
<chr> <dbl>
1 A 1
2 A 3
3 B 1
4 B 3
5 C 5
6 C 6
7 C 7
8 C 7
9 D 10
10 D 20
Output:
sample(df2, 2)
# A tibble: 6 × 2
ID1 ID2
<chr> <dbl>
1 A 3
2 A 1
3 C 6
4 C 7
5 D 20
6 D 10
I dont know whether I am oversimplifying the problem. Take a look at the following and see whether it works in your case:
library(tidyverse)
df %>%
group_split(ID1)%>%
reduce(~ bind_rows(.x, .y) %>%
filter(!duplicated(ID2))%>%
group_by(ID1)%>%
slice_sample(n=2) %>%
ungroup,
.init = slice_sample(.[[1]], n=2))
# A tibble: 8 x 2
ID1 ID2
<chr> <dbl>
1 A 1
2 A 6
3 B 2
4 B 3
5 C 4
6 C 5
7 D 9
8 D 8
Disclaimer: NOt vectorized, thus inefficient
Here is a base R option using dynamic programming (DP)
d <- table(df)
nms <- dimnames(d)
res <- list()
for (i in nms$ID1) {
idx <- which(d[i, ] > 0)
if (length(idx) >= 2) {
j <- sample(idx, 2)
res[[i]] <- nms$ID2[j]
d[, j] <- 0
}
}
dfout <- type.convert(
setNames(rev(stack(res)), names(df)),
as.is = TRUE
)
which gives
ID1 ID2
1 A 6
2 A 1
3 B 2
4 B 3
5 C 4
6 C 5
7 D 7
8 D 8
For the case with used ID2 already, e.g.,
> (df <- structure(list(ID1 = c(
+ "A", "A", "B", "B", "B", "C", "C", "C",
+ "D", "D", "D", "D"
+ ), ID2 = c(
+ 1, 3, 1, 2, 3, 3, 4, 5, 4, 5, 6, .... [TRUNCATED]
ID1 ID2
1 A 1
2 A 3
3 B 1
4 B 2
5 B 3
6 C 3
7 C 4
8 C 5
9 D 4
10 D 5
11 D 6
12 D 1
we will obtain
ID1 ID2
1 A 1
2 A 3
3 C 5
4 C 4

Subset a grouped data frame based on range of row position

I have a grouped data frame and I wish to keep for each group (name) the rows in a given range .For ex, between 2nd and 3rd position.
df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9)
df
name x
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 c 6
7 c 7
8 c 8
Here I want an output like this
name x
1 a 2
2 a 3
3 b 5
4 c 7
5 c 8
Thank you,
First, group_by name, then slice from index 2:3:
library(dplyr)
df %>%
group_by(name) %>%
slice(2:3)
# A tibble: 5 x 2
# Groups: name [3]
name x
<chr> <int>
1 a 2
2 a 3
3 b 5
4 c 7
5 c 8
The solution that I found is using dplyr::slice(2:3)

Filter by values that have the exact names given in a list (dplyr)

I have the following data.
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
1 1 a
2 1 b
3 1 a
4 2 a
5 2 b
6 2 c
7 2 b
8 3 a
9 3 b
10 3 c
11 3 d
12 3 d
I would like to filter only by the following list
set <- NULL
set$names <- c("a","b","c")
The ids selected are those that contain exactly the names in the list.
So the result would be only the 2s selected as follows:
> dat
# A tibble: 12 x 2
id name
<chr> <chr>
4 2 a
5 2 b
6 2 c
7 2 b
Here is the data for easy replication:
dat <- tribble(
~id, ~name,
1, "a",
1, "b",
1, "a",
2, "a",
2, "b",
2, "c",
2, "b",
3, "a",
3, "b",
3, "c",
3, "d",
3, "d"
)
I would like to have the following result.
How about:
group_by(dat, id) %>% filter(setequal(name, set$names))
This filters out all groups where the name column and set$names do not contain the same elements, but allows duplicates.
I am not sure it is what you want
dat %>%
group_by(id) %>%
filter(all(set$name %in% name) & all(name %in%set$name))
# A tibble: 4 x 2
id name
<dbl> <chr>
1 2 a
2 2 b
3 2 c
4 2 b

Resources