Matching controls to cases using multiple conditions - r

I want to match 2 controls for every case with two conditions:
the age difference should between ±2;
the income difference should between ±2.
If there are more than 2 controls for a case, I just need select 2 controls randomly.
There is an example:
EXAMPLE
DATA
dat = structure(list(id = c(1, 2, 3, 4, 111, 222, 333, 444, 555, 666,
777, 888, 999, 1000),
age = c(10, 20, 44, 11, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18),
income = c(35, 72, 11, 35, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70),
group = c("case", "case", "case", "case", "control", "control",
"control", "control", "control", "control", "control",
"control", "control", "control")),
row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))
> dat
# A tibble: 14 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 4 11 35 case
5 111 12 37 control
6 222 11 36 control
7 333 8 33 control
8 444 12 70 control
9 555 11 34 control
10 666 22 74 control
11 777 21 70 control
12 888 18 44 control
13 999 21 76 control
14 1000 18 70 control
EXPECT OUTCOME
For id = 1, the matched controls as below, and I just need select 2 controls randomly in the table below.
id
age
income
group
111
12
37
control
222
11
36
control
333
8
33
control
555
11
34
control
For id = 2,the matched controls as below, and I just need select 2 controls randomly in the table below.
id
age
income
group
666
22
74
control
777
21
70
control
1000
18
70
control
For id = 3,there is no matched controls in dat.
For id = 4, the matched controls as below, and I just need select 2 controls randomly in the table below.
One thing to note here is that we can find that the controls for id = 1 and id = 4 have overlapping parts. I don't want two cases to share a control, what I need is that if id = 1 chooses id = 111 and id = 222 as control, then id = 4 can only choose id = 555 as control, and if id = 1 chooses id = 111 and id = 333 as control, then id = 4 can only choose id = 222 and id = 555 as controls.
id
age
income
group
111
12
37
control
222
11
36
control
555
11
34
control
The final output maybe like this(the id in control group is randomly selected from the id that meets the conditions):
id
age
income
group
1
10
35
case
2
20
72
case
3
44
11
case
4
11
35
case
111
12
37
control
222
11
36
control
333
8
33
control
555
11
34
control
777
21
70
control
1000
18
70
control
NOTE
I've looked up some websites, but they don't meet my needs. I don't know how to implement my requirements using R code.
Any help will be highly appreciated!
Reference:
1.https://stackoverflow.com/questions/56026700/is-there-any-package-for-case-control-matching-individual-1n-matching-in-r-n
2.Case control matching in R (or spss), based on age, sex and ethnicity?
3.Matching case-controls in R using the ccoptimalmatch package
4.Exact Matching in R

As per modified requirement, I propose the following for loop
library(dplyr, warn.conflicts = F)
dat %>%
split(.$group) %>%
list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
control$FILTER <- FALSE
control
#> # A tibble: 10 x 5
#> id age income group FILTER
#> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 111 12 37 control FALSE
#> 2 222 11 36 control FALSE
#> 3 333 8 33 control FALSE
#> 4 444 12 70 control FALSE
#> 5 555 11 34 control FALSE
#> 6 666 22 74 control FALSE
#> 7 777 21 70 control FALSE
#> 8 888 18 44 control FALSE
#> 9 999 21 76 control FALSE
#> 10 1000 18 70 control FALSE
set.seed(123)
for(i in seq_len(nrow(case))){
x <- which(between(control$age, case$age[i] -2, case$age[i] +2) &
between(control$income, case$income[i] -2, case$income[i] + 2) &
!control$FILTER)
control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}
control
#> # A tibble: 10 x 5
#> id age income group FILTER
#> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 111 12 37 control TRUE
#> 2 222 11 36 control TRUE
#> 3 333 8 33 control TRUE
#> 4 444 12 70 control FALSE
#> 5 555 11 34 control TRUE
#> 6 666 22 74 control FALSE
#> 7 777 21 70 control TRUE
#> 8 888 18 44 control FALSE
#> 9 999 21 76 control FALSE
#> 10 1000 18 70 control TRUE
bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
#> # A tibble: 10 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 4 11 35 case
#> 5 111 12 37 control
#> 6 222 11 36 control
#> 7 333 8 33 control
#> 8 555 11 34 control
#> 9 777 21 70 control
#> 10 1000 18 70 control
Check results for a different seed
set.seed(234)
for(i in seq_len(nrow(case))){
x <- which(between(control$age, case$age[i] -2, case$age[i] +2) &
between(control$income, case$income[i] -2, case$income[i] + 2) &
!control$FILTER)
control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}
control
bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
# A tibble: 10 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 4 11 35 case
5 111 12 37 control
6 222 11 36 control
7 333 8 33 control
8 555 11 34 control
9 777 21 70 control
10 1000 18 70 control
dat modified before proceeding for id 3
split the data into two groups case and control using baseR's `split
save two as separate dfs using list2env
using purrr::map_df you can take sample 2 rows for each case
once for age
and once for income
finally sample 2 rows again from each of these results
bind_rows again these with case also
library(tidyverse)
dat = structure(list(id = c(1, 2, 3, 111, 222, 333, 444, 555, 666, 777, 888, 999, 1000),
age = c(10, 20, 44, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18),
income = c(35, 72, 11, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70),
group = c("case", "case", "case", "control", "control", "control",
"control", "control", "control", "control", "control",
"control", "control")),
row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"))
dat
#> # A tibble: 13 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 111 12 37 control
#> 5 222 11 36 control
#> 6 333 8 33 control
#> 7 444 12 70 control
#> 8 555 11 34 control
#> 9 666 22 74 control
#> 10 777 21 70 control
#> 11 888 18 44 control
#> 12 999 21 76 control
#> 13 1000 18 70 control
dat %>%
split(.$group) %>%
list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
set.seed(123)
bind_rows(case, map_dfr(case$age, ~ control %>% filter(between(age, .x -2, .x +2) ) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
map_dfr(case$income, ~ control %>% filter(between(income, .x -2, .x +2)) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)))
#> # A tibble: 7 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 222 11 36 control
#> 5 777 21 70 control
#> 6 111 12 37 control
#> 7 333 8 33 control
the below code will also do the same without saving individual dfs
dat %>%
split(.$group) %>%
{bind_rows(.$case,
map_dfr(.$case$age, \(.x) .$control %>% filter(between(age, .x -2, .x +2) ) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
map_dfr(.$case$income, \(.x) .$control %>% filter(between(income, .x -2, .x +2)) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)))}

Separate case and control in different dataframes. For each row in case_data find the matching rows in control_data and select 2 random rows from it.
Using map_df we can combine everything in one dataframe.
library(dplyr)
library(purrr)
case_data <- dat %>% filter(group == 'case')
control_data <- dat %>% filter(group == 'control')
case_data %>%
group_split(row_number(), .keep = FALSE) %>%
map_df(~bind_rows(.x, control_data %>%
filter(between(age, .x$age - 2, .x$age + 2),
between(income, .x$income - 2, .x$income + 2)) %>%
slice_sample(n = 2)))
# id age income group
# <dbl> <dbl> <dbl> <chr>
#1 1 10 35 case
#2 333 8 33 control
#3 111 12 37 control
#4 2 20 72 case
#5 666 22 74 control
#6 777 21 70 control

You could also the following solution. I wrapped map2 function inside curly braces so that I could choose what variables I would like to use for .x and .y, otherwise %>% would've replace the whole data set as the first argument:
library(dplyr)
library(purrr)
dat %>%
filter(group == "case") %>%
group_by(id) %>%
{map2(.$age, .$income, ~ dat %>%
filter(group == "control" & age >= .x - 2 & age <= .x + 2 &
income >= .y - 2 & income <= .y + 2))} %>%
map_dfr(~ .x %>%
slice_sample(n = 2)) %>%
bind_rows(dat %>%
filter(group == "case")) %>%
arrange(id)
# A tibble: 7 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 222 11 36 control
5 333 8 33 control
6 777 21 70 control
7 1000 18 70 control

Related

replace NAs in data frame with 'average if' of row

I have some data where each unique ID is a member of a group. There are some IDs with missing data, for these I'd like to take the average of the other members of the same group for that row.
For example, with the below data I'd like to replace the "NA" for id 3 in row V_2 with the average of the other Group A members for that row (average of 21 & 22). Similarly for id 7 in row V_3 it would be the average of 34 & 64.
Group=rep(c('A', 'B', 'C'), each=3)
id=1:9
V_1 = t(c(10,20,30,40,10,10,20,35,65))
V_2 = t(c(21,22,"NA",42,12,12,22,32,63))
V_3 = t(c(24,24,34,44,14,14,"NA",34,64))
df <- as.data.frame(rbind(Group, id, V_1, V_2, V_3))
df
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21 22 NA 42 12 12 22 32 63
X.2 24 24 34 44 14 14 NA 34 64
An approach using dplyr. The warnings occur because data frame columns are all character in your example (because the character class Group is in row 1). So ideally the whole data frame should be transposed...
library(dplyr)
library(tidyr)
tibble(data.frame(t(df))) %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ as.numeric(.x))) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
t() %>%
as.data.frame()
V1 V2 V3 V4 V5 V6 V7 V8 V9
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2 24 24 34 44 14 14 49 34 64
Warning messages:
1: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 1: Group = "A".
2: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 3: Group = "C".
Same example using transposed data
df_t %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
ungroup()
# A tibble: 9 × 5
Group id X X.1 X.2
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 10 21 24
2 A 2 20 22 24
3 A 3 30 21.5 34
4 B 4 40 42 44
5 B 5 10 12 14
6 B 6 10 12 14
7 C 7 20 22 49
8 C 8 35 32 34
9 C 9 65 63 64
with transpose back to wider format
df_t %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
t() %>%
as.data.frame()
V1 V2 V3 V4 V5 V6 V7 V8 V9
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2 24 24 34 44 14 14 49 34 64
transposed data
df_t <- structure(list(Group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C"), id = c(1, 2, 3, 4, 5, 6, 7, 8, 9), X = c(10, 20, 30, 40,
10, 10, 20, 35, 65), X.1 = c(21, 22, NA, 42, 12, 12, 22, 32,
63), X.2 = c(24, 24, 34, 44, 14, 14, NA, 34, 64)), class = "data.frame", row.names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9"))
Structuring the data the tidy way might make it easier. Package {Hmisc} offers a convenience impute helper (since this is such a frequent task). That way you could proceed as follows:
tidy the data
## example dataframe df:
set.seed(4711)
df <- data.frame(Group = gl(3, 3, labels = LETTERS[1:3]),
id = 1:9,
V_1 = sample(c(NA, 1:8)),
V_2 = sample(c(NA, 1:8)),
V_3 = sample(c(NA, 1:8))
)
## > df |> head()
## Group id V_1 V_2 V_3
## 1 A 1 1 7 6
## 2 A 2 4 8 2
## 3 A 3 3 2 3
## 4 B 4 6 4 1
## 5 B 5 5 3 8
## 6 B 6 NA NA 4
use {Hmisc} and {dplyr} together with the pipeline notation:
library(dplyr)
library(Hmisc)
df_imputed <-
df |> mutate(across(V_1:V_3, impute, mean))
> df_imputed |> head()
Group id V_1 V_2 V_3
1 A 1 1.0 7.0 6
2 A 2 4.0 8.0 2
3 A 3 3.0 2.0 3
4 B 4 6.0 4.0 1
5 B 5 5.0 3.0 8
6 B 6 4.5 4.5 4
Should you now prefer to replace missing values with groupwise medians instead of total means, the tidy arrangement (together with {dplyr}) requires only one additional group_by clause:
df |>
group_by(Group) |>
mutate(across(V_1:V_3, impute, median))

create a new variable based on other factors using R

So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)

Choosing the right column based on a vector of column names

I'm trying to pull values from columns based on the values in a vector. I'm not sure I have the right words to describe the problem, but the code should help.
This feels related to coalesce maybe not?
library(tidyverse)
# Starting table
dat <-
tibble(
A = 1:10,
B = 31:40,
C = 101:110,
value = c("A", "C", "B", "A", "B", "C", "C", "B", "A", "A")
)
I want:
dat %>%
mutate(
output = c(1, 102, 33, 4, 35, 106, 107, 38, 9, 10)
)
I could do
dat %>%
mutate(
output =
case_when(value == "A" ~ A,
value == "B" ~ B,
value == "C" ~ C)
)
but my real application has many values and I want to take advantage of value having the matching info
Is there a function that does:
dat %>%
mutate(output = grab_the_right_column(value))
Thanks!
The rowwise approach would be less efficient, but it is compact within the tidyverse approaches to get the column value based on the column name for each row.
library(dplyr)
dat %>%
rowwise %>%
mutate(output = get(value)) %>%
ungroup
-output
# A tibble: 10 x 5
# A B C value output
# <int> <int> <int> <chr> <int>
# 1 1 31 101 A 1
# 2 2 32 102 C 102
# 3 3 33 103 B 33
# 4 4 34 104 A 4
# 5 5 35 105 B 35
# 6 6 36 106 C 106
# 7 7 37 107 C 107
# 8 8 38 108 B 38
# 9 9 39 109 A 9
#10 10 40 110 A 10
These type of issues are more efficient with a row/column indexing approach from base R. Create a matrix of row sequence and the matching index of columns with the 'value' column and the column names to extract the element
dat$output <- as.data.frame(dat)[,1:3][cbind(seq_len(nrow(dat)), match(dat$value, names(dat)[1:3]))]
You can also use purrr and pmap():
library(dplyr)
library(purrr)
dat%>%mutate(output=
pmap(., ~{
v1<-c(...)
v1[names(v1)==v1[['value']]]
}
)%>%
as.numeric()%>%
unlist)
# A tibble: 10 x 5
A B C value output
<int> <int> <int> <chr> <dbl>
1 1 31 101 A 1
2 2 32 102 C 102
3 3 33 103 B 33
4 4 34 104 A 4
5 5 35 105 B 35
6 6 36 106 C 106
7 7 37 107 C 107
8 8 38 108 B 38
9 9 39 109 A 9
10 10 40 110 A 10

Selected columns to new row

I'm trying to split columns into new rows keeping the data of the first two columns.
d1 <- data.frame(a=c(100,0,78),b=c(0,137,117),c.1=c(111,17,91), d.1=c(99,66,22), c.2=c(11,33,44), d.2=c(000,001,002))
d1
a b c.1 d.1 c.2 d.2
1 100 0 111 99 11 0
2 0 137 17 66 33 1
3 78 117 91 22 44 2
Expected results would be:
a b c d
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2
Multiple tries with dplyr, but in sees is not the right approach.
If you want to stay in dplyr/tidyverse, you want tidyr::pivot_longer with a special reference to .value -- see the pivot vignette for more:
library(tidyverse)
d1 <- data.frame(
a = c(100, 0, 78),
b = c(0, 137, 117),
c.1 = c(111, 17, 91),
d.1 = c(99, 66, 22),
c.2 = c(11, 33, 44),
d.2 = c(000, 001, 002)
)
d1 %>%
pivot_longer(
cols = contains("."),
names_to = c(".value", "group"),
names_sep = "\\."
)
#> # A tibble: 6 x 5
#> a b group c d
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 100 0 1 111 99
#> 2 100 0 2 11 0
#> 3 0 137 1 17 66
#> 4 0 137 2 33 1
#> 5 78 117 1 91 22
#> 6 78 117 2 44 2
Created on 2020-05-11 by the reprex package (v0.3.0)
This could solve your issue:
#Try this
a1 <- d1[,c(1:4)]
a2 <- d1[,c(1,2,5,6)]
names(a1) <- names(a2) <- c('a','b','c','d')
DF <- rbind(a1,a2)
The posted answers are good, here's my attempt:
df <- data.frame(a=c(100,0,78),b=c(0,137,117),
c.1=c(111,17,91), d.1=c(99,66,22),
c.2=c(11,33,44), d.2=c(000,001,002))
# Make 2 pivot long operations
df_c <- df %>% select(-d.1, -d.2) %>%
pivot_longer(cols = c("c.1", "c.2"), values_to = "c") %>% select(-name)
df_d <- df %>% select(-c.1, -c.2) %>%
pivot_longer(cols=c("d.1","d.2"), values_to = "d") %>% select(-name)
# bind them without the "key" colums
bind_cols(df_c, select(df_d, -a, -b))
Which produces
# A tibble: 6 x 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 100 0 111 99
2 100 0 11 0
3 0 137 17 66
4 0 137 33 1
5 78 117 91 22
6 78 117 44 2

R create a column to identify the group that row belong to

Description of Data: Dataset contains information regarding users about their age, gender and membership they are holding.
Goal: Create a new column to identify the group/label for each user based on pre-defined conditions.
Age conditions: multiple age brackets :
18 >= age <= 24, 25 >= age <=30, 31 >= age <= 41, 41 >= age <= 60, age >= 61
Gender: M/F
Membership: A,B,C,I
I created sample data frame to try out creation of new column to identify the group/label
df = data.frame(userid = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12),
age = c(18, 61, 23, 35, 30, 25, 55, 53, 45, 41, 21, NA),
gender = c('F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', '<NA>', 'M'),
membership = c('A', 'B', 'A', 'C', 'C', 'B', 'A', 'A', 'I', 'I', 'A', '<NA>'))
userid age gender membership
1 1 18 F A
2 2 61 M B
3 3 23 F A
4 4 35 F C
5 5 30 M C
6 6 25 M B
7 7 55 M A
8 8 53 M A
9 9 45 M I
10 10 41 F I
11 11 21 <NA> A
12 12 NA M <NA>
Based on above data there exist 4 * 2 * 5 options (combinations)
Final outcome:
userid age gender membership GroupID
1 1 16 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
userid age gender membership GroupID
1 1 18 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
11 11 21 <NA> A 43 (assuming it will auto-detec combo)
12 12 NA M <NA> 46
I believe my calculation of combinations are correct and if so how can I use dplyr or any other option to get above data frame.
Use multiple if conditions to confirm all the options?
In dplyr is there a way to actually provide conditions for each column to set the grouping conditions:
df %>% group_by(age, gender, membership)
Two options,
One, more automated;
# install.packages(c("tidyverse""), dependencies = TRUE)
library(tidyverse)
df %>% mutate(ageCat = cut(age, breaks = c(-Inf, 24, 30, 41, 60, Inf))) %>%
mutate(GroupID = group_indices(., ageCat, gender, membership)) %>% select(-ageCat)
#> userid age gender membership GroupID
#> 1 1 18 F A 2
#> 2 2 61 M B 9
#> 3 3 23 F A 2
#> 4 4 35 F C 5
#> 5 5 30 M C 4
#> 6 6 25 M B 3
#> 7 7 55 M A 7
#> 8 8 53 M A 7
#> 9 9 45 M I 8
#> 10 10 41 F I 6
#> 11 11 21 <NA> A 1
#> 12 12 NA M <NA> 10
Two, more manual;
Here I make an illustration of a solution with category 1 and 4, you have to code the rest yourself.
df %>% mutate(GroupID =
ifelse((age >= 18 | age > 25) & gender == 'F' & membership == "A", 1,
ifelse((age >= 31 | age > 41) & gender == 'F' & membership == "C", 4, NA)
))
#> userid age gender membership GroupID
#> 1 1 18 F A 1
#> 2 2 61 M B NA
#> 3 3 23 F A 1
#> 4 4 35 F C 4
#> 5 5 30 M C NA
#> 6 6 25 M B NA
#> 7 7 55 M A NA
#> 8 8 53 M A NA
#> 9 9 45 M I NA
#> 10 10 41 F I NA
#> 11 11 21 <NA> A NA
#> 12 12 NA M <NA> NA
the data structure in case others feel like giving it a go,
You can try this:
setDT(df)[,agegrp:= ifelse((df$age >= 18) & (df$age <= 24), 1, ifelse((df$age >= 25) & (df$age <= 30), 2, ifelse((df$age >= 31) & (df$age <= 41),3,ifelse((df$age >= 42) & (df$age <= 60),4,5))))]
setDT(df)[, group := .GRP, by = .(agegrp,gender, membership)]
If you want to use base R only, you could do something like this:
# 1
allcombos <- expand.grid(c("M", "F"), c("A", "B", "C", "I"), 1:5)
allgroups <- do.call(paste0, allcombos) # 40 unique combinations
# 2
agegroups <- cut(df$age,
breaks = c(17, 24, 30, 41, 61, 99),
labels = c(1, 2, 3, 4, 5))
# 3
df$groupid <- paste0(df$gender, df$membership, agegroups)
df$groupid <- factor(df$groupid, levels=allgroups, labels=1:length(allgroups))
expand.grid gives you a data.frame with three columns where every row represents a unique combination of the three arguments provided. As you said, these are 40 combinations. The second line combines every row of the data frame in a single string, like "MA1", "FA1", "MB1", etc.
Then we use cut to each age to its relevant age group with names 1 to 5.
We create a column in df that contains the three character combination of the gender, membership and age group which is then converted to a factor, according to all possible combinations we found in allgroups.

Resources