I have some data where each unique ID is a member of a group. There are some IDs with missing data, for these I'd like to take the average of the other members of the same group for that row.
For example, with the below data I'd like to replace the "NA" for id 3 in row V_2 with the average of the other Group A members for that row (average of 21 & 22). Similarly for id 7 in row V_3 it would be the average of 34 & 64.
Group=rep(c('A', 'B', 'C'), each=3)
id=1:9
V_1 = t(c(10,20,30,40,10,10,20,35,65))
V_2 = t(c(21,22,"NA",42,12,12,22,32,63))
V_3 = t(c(24,24,34,44,14,14,"NA",34,64))
df <- as.data.frame(rbind(Group, id, V_1, V_2, V_3))
df
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21 22 NA 42 12 12 22 32 63
X.2 24 24 34 44 14 14 NA 34 64
An approach using dplyr. The warnings occur because data frame columns are all character in your example (because the character class Group is in row 1). So ideally the whole data frame should be transposed...
library(dplyr)
library(tidyr)
tibble(data.frame(t(df))) %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ as.numeric(.x))) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
t() %>%
as.data.frame()
V1 V2 V3 V4 V5 V6 V7 V8 V9
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2 24 24 34 44 14 14 49 34 64
Warning messages:
1: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 1: Group = "A".
2: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 3: Group = "C".
Same example using transposed data
df_t %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
ungroup()
# A tibble: 9 × 5
Group id X X.1 X.2
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 10 21 24
2 A 2 20 22 24
3 A 3 30 21.5 34
4 B 4 40 42 44
5 B 5 10 12 14
6 B 6 10 12 14
7 C 7 20 22 49
8 C 8 35 32 34
9 C 9 65 63 64
with transpose back to wider format
df_t %>%
group_by(Group) %>%
mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
t() %>%
as.data.frame()
V1 V2 V3 V4 V5 V6 V7 V8 V9
Group A A A B B B C C C
id 1 2 3 4 5 6 7 8 9
X 10 20 30 40 10 10 20 35 65
X.1 21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2 24 24 34 44 14 14 49 34 64
transposed data
df_t <- structure(list(Group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C"), id = c(1, 2, 3, 4, 5, 6, 7, 8, 9), X = c(10, 20, 30, 40,
10, 10, 20, 35, 65), X.1 = c(21, 22, NA, 42, 12, 12, 22, 32,
63), X.2 = c(24, 24, 34, 44, 14, 14, NA, 34, 64)), class = "data.frame", row.names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9"))
Structuring the data the tidy way might make it easier. Package {Hmisc} offers a convenience impute helper (since this is such a frequent task). That way you could proceed as follows:
tidy the data
## example dataframe df:
set.seed(4711)
df <- data.frame(Group = gl(3, 3, labels = LETTERS[1:3]),
id = 1:9,
V_1 = sample(c(NA, 1:8)),
V_2 = sample(c(NA, 1:8)),
V_3 = sample(c(NA, 1:8))
)
## > df |> head()
## Group id V_1 V_2 V_3
## 1 A 1 1 7 6
## 2 A 2 4 8 2
## 3 A 3 3 2 3
## 4 B 4 6 4 1
## 5 B 5 5 3 8
## 6 B 6 NA NA 4
use {Hmisc} and {dplyr} together with the pipeline notation:
library(dplyr)
library(Hmisc)
df_imputed <-
df |> mutate(across(V_1:V_3, impute, mean))
> df_imputed |> head()
Group id V_1 V_2 V_3
1 A 1 1.0 7.0 6
2 A 2 4.0 8.0 2
3 A 3 3.0 2.0 3
4 B 4 6.0 4.0 1
5 B 5 5.0 3.0 8
6 B 6 4.5 4.5 4
Should you now prefer to replace missing values with groupwise medians instead of total means, the tidy arrangement (together with {dplyr}) requires only one additional group_by clause:
df |>
group_by(Group) |>
mutate(across(V_1:V_3, impute, median))
Related
I have a two datasets
cycle <- c(160, 160, 150, 158, 180)
split1 <- c(2, 2,4, 6, 8)
split2 <- c(10,10, 12, 14, 16)
df1 <- data.frame(cycle, split1, split2)
df1
cycle split1 split2
1 160 2 10
2 160 2 10
3 150 4 12
4 158 6 14
5 180 8 16
cycle <- c(160,150,190,180,161,150,140,179)
split1 <- c(2,4,12,8,2,4,32,8)
split2 <- c(10, 12, 18, 16, 10, 12, 21, 16)
df2 <- data.frame(cycle, split1, split2)
df2
cycle split1 split2
1 160 2 10
2 150 4 12
3 190 12 18
4 180 8 16
5 161 2 10
6 150 4 12
7 140 32 21
8 179 8 16
I want to match the values of df1 and df2 and label the df2 values based on two conditions:
1- If the values of all three columns i.e cycle, split1, and split2 are exactly the same then assign a row with the label "Same" otherwise "Different".
2- If the difference of only cycle value from df1 and df2 is +1 or -1 and the rest of the row values are the same then assign a row with the label "Same" otherwise "Different".
The output should look like this
cycle split1 split2 Type
1 160 2 10 Same
2 150 4 12 Same
3 190 12 18 Different
4 180 8 16 Same
5 161 2 10 Same
6 150 4 12 Same
7 140 32 21 Different
8 179 8 16 Same
I was successful in achieving the first condition as below
df1<- df1 %>% mutate(key = paste0(cycle,split1, split2, "_"))
df2<- df2 %>% mutate(key = paste0(cycle,split1, split2, "_"))
df2 %>% mutate(Type = ifelse(df2$key %in% df1$key, 'same', 'different'))%>%
select(-key)
cycle split1 split2 Type
1 160 2 10 same
2 150 4 12 same
3 190 12 18 different
4 180 8 16 same
5 161 2 10 different
6 150 4 12 same
7 140 32 21 different
8 179 8 16 different
but having a problem achieving the second one.
Any idea how to do this efficiently?
Thank you in advance.
Based on your original df1 and df2 (without the generation of the new column key), you could use
df2 %>%
mutate(rn = row_number()) %>%
left_join(df1, by = c("split1", "split2"), suffix = c("", ".y")) %>%
mutate(
type = coalesce(
ifelse(abs(cycle - cycle.y) <= 1, "same", "different"),
"different")
) %>%
group_by(rn) %>%
distinct() %>%
ungroup() %>%
select(-rn, -cycle.y)
This returns
# A tibble: 8 x 4
cycle split1 split2 type
<dbl> <dbl> <dbl> <chr>
1 160 2 10 same
2 150 4 12 same
3 190 12 18 different
4 180 8 16 same
5 161 2 10 same
6 150 4 12 same
7 140 32 21 different
8 179 8 16 same
I want to match 2 controls for every case with two conditions:
the age difference should between ±2;
the income difference should between ±2.
If there are more than 2 controls for a case, I just need select 2 controls randomly.
There is an example:
EXAMPLE
DATA
dat = structure(list(id = c(1, 2, 3, 4, 111, 222, 333, 444, 555, 666,
777, 888, 999, 1000),
age = c(10, 20, 44, 11, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18),
income = c(35, 72, 11, 35, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70),
group = c("case", "case", "case", "case", "control", "control",
"control", "control", "control", "control", "control",
"control", "control", "control")),
row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))
> dat
# A tibble: 14 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 4 11 35 case
5 111 12 37 control
6 222 11 36 control
7 333 8 33 control
8 444 12 70 control
9 555 11 34 control
10 666 22 74 control
11 777 21 70 control
12 888 18 44 control
13 999 21 76 control
14 1000 18 70 control
EXPECT OUTCOME
For id = 1, the matched controls as below, and I just need select 2 controls randomly in the table below.
id
age
income
group
111
12
37
control
222
11
36
control
333
8
33
control
555
11
34
control
For id = 2,the matched controls as below, and I just need select 2 controls randomly in the table below.
id
age
income
group
666
22
74
control
777
21
70
control
1000
18
70
control
For id = 3,there is no matched controls in dat.
For id = 4, the matched controls as below, and I just need select 2 controls randomly in the table below.
One thing to note here is that we can find that the controls for id = 1 and id = 4 have overlapping parts. I don't want two cases to share a control, what I need is that if id = 1 chooses id = 111 and id = 222 as control, then id = 4 can only choose id = 555 as control, and if id = 1 chooses id = 111 and id = 333 as control, then id = 4 can only choose id = 222 and id = 555 as controls.
id
age
income
group
111
12
37
control
222
11
36
control
555
11
34
control
The final output maybe like this(the id in control group is randomly selected from the id that meets the conditions):
id
age
income
group
1
10
35
case
2
20
72
case
3
44
11
case
4
11
35
case
111
12
37
control
222
11
36
control
333
8
33
control
555
11
34
control
777
21
70
control
1000
18
70
control
NOTE
I've looked up some websites, but they don't meet my needs. I don't know how to implement my requirements using R code.
Any help will be highly appreciated!
Reference:
1.https://stackoverflow.com/questions/56026700/is-there-any-package-for-case-control-matching-individual-1n-matching-in-r-n
2.Case control matching in R (or spss), based on age, sex and ethnicity?
3.Matching case-controls in R using the ccoptimalmatch package
4.Exact Matching in R
As per modified requirement, I propose the following for loop
library(dplyr, warn.conflicts = F)
dat %>%
split(.$group) %>%
list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
control$FILTER <- FALSE
control
#> # A tibble: 10 x 5
#> id age income group FILTER
#> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 111 12 37 control FALSE
#> 2 222 11 36 control FALSE
#> 3 333 8 33 control FALSE
#> 4 444 12 70 control FALSE
#> 5 555 11 34 control FALSE
#> 6 666 22 74 control FALSE
#> 7 777 21 70 control FALSE
#> 8 888 18 44 control FALSE
#> 9 999 21 76 control FALSE
#> 10 1000 18 70 control FALSE
set.seed(123)
for(i in seq_len(nrow(case))){
x <- which(between(control$age, case$age[i] -2, case$age[i] +2) &
between(control$income, case$income[i] -2, case$income[i] + 2) &
!control$FILTER)
control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}
control
#> # A tibble: 10 x 5
#> id age income group FILTER
#> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 111 12 37 control TRUE
#> 2 222 11 36 control TRUE
#> 3 333 8 33 control TRUE
#> 4 444 12 70 control FALSE
#> 5 555 11 34 control TRUE
#> 6 666 22 74 control FALSE
#> 7 777 21 70 control TRUE
#> 8 888 18 44 control FALSE
#> 9 999 21 76 control FALSE
#> 10 1000 18 70 control TRUE
bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
#> # A tibble: 10 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 4 11 35 case
#> 5 111 12 37 control
#> 6 222 11 36 control
#> 7 333 8 33 control
#> 8 555 11 34 control
#> 9 777 21 70 control
#> 10 1000 18 70 control
Check results for a different seed
set.seed(234)
for(i in seq_len(nrow(case))){
x <- which(between(control$age, case$age[i] -2, case$age[i] +2) &
between(control$income, case$income[i] -2, case$income[i] + 2) &
!control$FILTER)
control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}
control
bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
# A tibble: 10 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 4 11 35 case
5 111 12 37 control
6 222 11 36 control
7 333 8 33 control
8 555 11 34 control
9 777 21 70 control
10 1000 18 70 control
dat modified before proceeding for id 3
split the data into two groups case and control using baseR's `split
save two as separate dfs using list2env
using purrr::map_df you can take sample 2 rows for each case
once for age
and once for income
finally sample 2 rows again from each of these results
bind_rows again these with case also
library(tidyverse)
dat = structure(list(id = c(1, 2, 3, 111, 222, 333, 444, 555, 666, 777, 888, 999, 1000),
age = c(10, 20, 44, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18),
income = c(35, 72, 11, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70),
group = c("case", "case", "case", "control", "control", "control",
"control", "control", "control", "control", "control",
"control", "control")),
row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"))
dat
#> # A tibble: 13 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 111 12 37 control
#> 5 222 11 36 control
#> 6 333 8 33 control
#> 7 444 12 70 control
#> 8 555 11 34 control
#> 9 666 22 74 control
#> 10 777 21 70 control
#> 11 888 18 44 control
#> 12 999 21 76 control
#> 13 1000 18 70 control
dat %>%
split(.$group) %>%
list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
set.seed(123)
bind_rows(case, map_dfr(case$age, ~ control %>% filter(between(age, .x -2, .x +2) ) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
map_dfr(case$income, ~ control %>% filter(between(income, .x -2, .x +2)) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)))
#> # A tibble: 7 x 4
#> id age income group
#> <dbl> <dbl> <dbl> <chr>
#> 1 1 10 35 case
#> 2 2 20 72 case
#> 3 3 44 11 case
#> 4 222 11 36 control
#> 5 777 21 70 control
#> 6 111 12 37 control
#> 7 333 8 33 control
the below code will also do the same without saving individual dfs
dat %>%
split(.$group) %>%
{bind_rows(.$case,
map_dfr(.$case$age, \(.x) .$control %>% filter(between(age, .x -2, .x +2) ) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
map_dfr(.$case$income, \(.x) .$control %>% filter(between(income, .x -2, .x +2)) %>%
sample_n(min(n(),2))) %>% sample_n(min(n(),2)))}
Separate case and control in different dataframes. For each row in case_data find the matching rows in control_data and select 2 random rows from it.
Using map_df we can combine everything in one dataframe.
library(dplyr)
library(purrr)
case_data <- dat %>% filter(group == 'case')
control_data <- dat %>% filter(group == 'control')
case_data %>%
group_split(row_number(), .keep = FALSE) %>%
map_df(~bind_rows(.x, control_data %>%
filter(between(age, .x$age - 2, .x$age + 2),
between(income, .x$income - 2, .x$income + 2)) %>%
slice_sample(n = 2)))
# id age income group
# <dbl> <dbl> <dbl> <chr>
#1 1 10 35 case
#2 333 8 33 control
#3 111 12 37 control
#4 2 20 72 case
#5 666 22 74 control
#6 777 21 70 control
You could also the following solution. I wrapped map2 function inside curly braces so that I could choose what variables I would like to use for .x and .y, otherwise %>% would've replace the whole data set as the first argument:
library(dplyr)
library(purrr)
dat %>%
filter(group == "case") %>%
group_by(id) %>%
{map2(.$age, .$income, ~ dat %>%
filter(group == "control" & age >= .x - 2 & age <= .x + 2 &
income >= .y - 2 & income <= .y + 2))} %>%
map_dfr(~ .x %>%
slice_sample(n = 2)) %>%
bind_rows(dat %>%
filter(group == "case")) %>%
arrange(id)
# A tibble: 7 x 4
id age income group
<dbl> <dbl> <dbl> <chr>
1 1 10 35 case
2 2 20 72 case
3 3 44 11 case
4 222 11 36 control
5 333 8 33 control
6 777 21 70 control
7 1000 18 70 control
I'm trying to pull values from columns based on the values in a vector. I'm not sure I have the right words to describe the problem, but the code should help.
This feels related to coalesce maybe not?
library(tidyverse)
# Starting table
dat <-
tibble(
A = 1:10,
B = 31:40,
C = 101:110,
value = c("A", "C", "B", "A", "B", "C", "C", "B", "A", "A")
)
I want:
dat %>%
mutate(
output = c(1, 102, 33, 4, 35, 106, 107, 38, 9, 10)
)
I could do
dat %>%
mutate(
output =
case_when(value == "A" ~ A,
value == "B" ~ B,
value == "C" ~ C)
)
but my real application has many values and I want to take advantage of value having the matching info
Is there a function that does:
dat %>%
mutate(output = grab_the_right_column(value))
Thanks!
The rowwise approach would be less efficient, but it is compact within the tidyverse approaches to get the column value based on the column name for each row.
library(dplyr)
dat %>%
rowwise %>%
mutate(output = get(value)) %>%
ungroup
-output
# A tibble: 10 x 5
# A B C value output
# <int> <int> <int> <chr> <int>
# 1 1 31 101 A 1
# 2 2 32 102 C 102
# 3 3 33 103 B 33
# 4 4 34 104 A 4
# 5 5 35 105 B 35
# 6 6 36 106 C 106
# 7 7 37 107 C 107
# 8 8 38 108 B 38
# 9 9 39 109 A 9
#10 10 40 110 A 10
These type of issues are more efficient with a row/column indexing approach from base R. Create a matrix of row sequence and the matching index of columns with the 'value' column and the column names to extract the element
dat$output <- as.data.frame(dat)[,1:3][cbind(seq_len(nrow(dat)), match(dat$value, names(dat)[1:3]))]
You can also use purrr and pmap():
library(dplyr)
library(purrr)
dat%>%mutate(output=
pmap(., ~{
v1<-c(...)
v1[names(v1)==v1[['value']]]
}
)%>%
as.numeric()%>%
unlist)
# A tibble: 10 x 5
A B C value output
<int> <int> <int> <chr> <dbl>
1 1 31 101 A 1
2 2 32 102 C 102
3 3 33 103 B 33
4 4 34 104 A 4
5 5 35 105 B 35
6 6 36 106 C 106
7 7 37 107 C 107
8 8 38 108 B 38
9 9 39 109 A 9
10 10 40 110 A 10
Hello fellow Overflowers,
the goal is to process certain steps of data manipulation on a fairy big dataset. In a first step, certain variables, which represent different cases of a certain information, shall be aggregated for each case. There are always 5 variables to aggregate.
Right now, the dataset looks like this:
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 ... xyz5 A B C
case1 3 4 7 9 6 21 13 4 1 7 8
case2 9 12 8 17 25 31 7 2 7 6
case3 5 3 11 10 32 19 13 5 1 6 8
...
It should somehow look like this
mean-a mean-b ...mean-xyz A B C
case1 5,8 17 6,4 1 7 8
case2 9,6 24,3 8,3 2 7 6
case3 7,25 21,3 7 1 6 8
...
I'm not sure if building a function or using the acrossfunction from the dplyr package is the right way to do it, since there are about 2000 variables which need to be aggregated.
Any help will be greatly appreciated.
Thanks a lot in advance!
You can also use the following solution:
library(dplyr)
library(stringr)
library(purrr)
# First we extract the unique letter values of column names
letters <- unique(str_remove(names(df), "\\d"))
[1] "a" "b"
# Then we iterate over each unique values and extract the columns that contain that unique letter
letters %>%
map(~ df %>%
select(contains(.x)) %>%
rowwise() %>%
mutate("mean_{.x}" := mean(c_across(contains(.x)), na.rm = TRUE))) %>%
bind_cols() %>%
relocate(contains("mean"), .after = last_col())
# A tibble: 3 x 12
# Rowwise:
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 mean_a mean_b
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 4 7 9 6 21 13 7 8 4 5.8 10.6
2 9 12 8 17 25 31 4 2 2 7 14.2 9.2
3 5 3 11 10 32 19 13 2 2 5 12.2 8.2
Data
df <- tribble(
~a1, ~a2, ~a3, ~a4, ~a5, ~b1, ~b2, ~b3, ~b4, ~b5,
3, 4, 7, 9, 6, 21, 13, 7, 8, 4,
9, 12, 8, 17, 25, 31, 4, 2, 2, 7,
5, 3, 11, 10, 32, 19, 13, 2, 2, 5
)
Example Data:
# toy data
library(data.table)
m <- matrix(1:30, ncol = 10)
colnames(m) <- c(paste0('a', 1:5), paste0('b', 1:5))
d <- data.table(m)
d
a1 a2 a3 a4 a5 b1 b2 b3 b4 b5
1: 1 4 7 10 13 16 19 22 25 28
2: 2 5 8 11 14 17 20 23 26 29
3: 3 6 9 12 15 18 21 24 27 30
Determine Groups:
You can determine first the groups you want to aggregate.
groups <- split(colnames(d), gsub("\\d", "", colnames(d)))
groups
$a
[1] "a1" "a2" "a3" "a4" "a5"
$b
[1] "b1" "b2" "b3" "b4" "b5"
Aggregate
Afterwards you simply calculate the mean of each group.
> d[,lapply(groups, function(i) {rowMeans(d[, i, with = FALSE])})]
a b
1: 7 22
2: 8 23
3: 9 24
Description of Data: Dataset contains information regarding users about their age, gender and membership they are holding.
Goal: Create a new column to identify the group/label for each user based on pre-defined conditions.
Age conditions: multiple age brackets :
18 >= age <= 24, 25 >= age <=30, 31 >= age <= 41, 41 >= age <= 60, age >= 61
Gender: M/F
Membership: A,B,C,I
I created sample data frame to try out creation of new column to identify the group/label
df = data.frame(userid = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12),
age = c(18, 61, 23, 35, 30, 25, 55, 53, 45, 41, 21, NA),
gender = c('F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', '<NA>', 'M'),
membership = c('A', 'B', 'A', 'C', 'C', 'B', 'A', 'A', 'I', 'I', 'A', '<NA>'))
userid age gender membership
1 1 18 F A
2 2 61 M B
3 3 23 F A
4 4 35 F C
5 5 30 M C
6 6 25 M B
7 7 55 M A
8 8 53 M A
9 9 45 M I
10 10 41 F I
11 11 21 <NA> A
12 12 NA M <NA>
Based on above data there exist 4 * 2 * 5 options (combinations)
Final outcome:
userid age gender membership GroupID
1 1 16 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
userid age gender membership GroupID
1 1 18 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
11 11 21 <NA> A 43 (assuming it will auto-detec combo)
12 12 NA M <NA> 46
I believe my calculation of combinations are correct and if so how can I use dplyr or any other option to get above data frame.
Use multiple if conditions to confirm all the options?
In dplyr is there a way to actually provide conditions for each column to set the grouping conditions:
df %>% group_by(age, gender, membership)
Two options,
One, more automated;
# install.packages(c("tidyverse""), dependencies = TRUE)
library(tidyverse)
df %>% mutate(ageCat = cut(age, breaks = c(-Inf, 24, 30, 41, 60, Inf))) %>%
mutate(GroupID = group_indices(., ageCat, gender, membership)) %>% select(-ageCat)
#> userid age gender membership GroupID
#> 1 1 18 F A 2
#> 2 2 61 M B 9
#> 3 3 23 F A 2
#> 4 4 35 F C 5
#> 5 5 30 M C 4
#> 6 6 25 M B 3
#> 7 7 55 M A 7
#> 8 8 53 M A 7
#> 9 9 45 M I 8
#> 10 10 41 F I 6
#> 11 11 21 <NA> A 1
#> 12 12 NA M <NA> 10
Two, more manual;
Here I make an illustration of a solution with category 1 and 4, you have to code the rest yourself.
df %>% mutate(GroupID =
ifelse((age >= 18 | age > 25) & gender == 'F' & membership == "A", 1,
ifelse((age >= 31 | age > 41) & gender == 'F' & membership == "C", 4, NA)
))
#> userid age gender membership GroupID
#> 1 1 18 F A 1
#> 2 2 61 M B NA
#> 3 3 23 F A 1
#> 4 4 35 F C 4
#> 5 5 30 M C NA
#> 6 6 25 M B NA
#> 7 7 55 M A NA
#> 8 8 53 M A NA
#> 9 9 45 M I NA
#> 10 10 41 F I NA
#> 11 11 21 <NA> A NA
#> 12 12 NA M <NA> NA
the data structure in case others feel like giving it a go,
You can try this:
setDT(df)[,agegrp:= ifelse((df$age >= 18) & (df$age <= 24), 1, ifelse((df$age >= 25) & (df$age <= 30), 2, ifelse((df$age >= 31) & (df$age <= 41),3,ifelse((df$age >= 42) & (df$age <= 60),4,5))))]
setDT(df)[, group := .GRP, by = .(agegrp,gender, membership)]
If you want to use base R only, you could do something like this:
# 1
allcombos <- expand.grid(c("M", "F"), c("A", "B", "C", "I"), 1:5)
allgroups <- do.call(paste0, allcombos) # 40 unique combinations
# 2
agegroups <- cut(df$age,
breaks = c(17, 24, 30, 41, 61, 99),
labels = c(1, 2, 3, 4, 5))
# 3
df$groupid <- paste0(df$gender, df$membership, agegroups)
df$groupid <- factor(df$groupid, levels=allgroups, labels=1:length(allgroups))
expand.grid gives you a data.frame with three columns where every row represents a unique combination of the three arguments provided. As you said, these are 40 combinations. The second line combines every row of the data frame in a single string, like "MA1", "FA1", "MB1", etc.
Then we use cut to each age to its relevant age group with names 1 to 5.
We create a column in df that contains the three character combination of the gender, membership and age group which is then converted to a factor, according to all possible combinations we found in allgroups.