Related
Hi I'm currently using a large observational dataset to estimate the average effect of a treatment. To balance the treatment and the control groups, I matched individuals based on a series of variables by using the full_join command.
matched_sample <- full_join(case, control, by = matched_varaibles)
The matched sample ended up with many rows because some individuals were matched more than once. I documented the number of matches found for each individual. Here I present a simpler version:
case_id <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "E", "F", "F")
num_controls_matched <- c(7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 5, 5, 5, 5, 5, 2, 2, 1, 2, 2)
control_id <- c("a" , "b", "c", "d", "e", "f", "g", "a", "b", "e", "a", "b", "e", "f", "h", "a", "e", "a", "b", "e")
num_cases_matched <- c(5, 4, 1, 1, 5, 2, 1, 5, 4, 5, 5, 4, 5, 2, 1, 5, 5, 5, 4, 5)
case_id num_controls_matched control_id num_cases_matched
1 A 7 a 5
2 A 7 b 4
3 A 7 c 1
4 A 7 d 1
5 A 7 e 5
6 A 7 f 2
7 A 7 g 1
8 B 3 a 5
9 B 3 b 4
10 B 3 e 5
11 C 5 a 5
12 C 5 b 4
13 C 5 e 5
14 C 5 f 2
15 C 5 h 1
16 D 2 a 5
17 D 2 e 5
18 E 1 a 5
19 F 2 b 4
20 F 2 e 5
where case_id and control_id are IDs of those from the treatment and the control groups, num_controls_matched is the number of matches found for the treated individuals, and num_cases_matched is the number of matches found for individuals in the control group.
I would like to keep as many treated individuals in the sample as possible. I would also like to prioritise the matches for the "less popular" individuals. For example, the treated individual E was only matched to 1 control, so the match E-a should be prioritised. Then, both D and F have 2 matches. Because b has only 4 matches whilst a and e both have 5 matches, F-b should be prioritised. Therefore, D can only be matched with e. The next one should be B because it has 3 matches. However, since a, b and e have already been matched with D, E and F, B has no match (NA). C is matched with h because h has only 1 match. A can be matched with c, d, or g.
I would like to construct data frame to indicate the final 1:1 matches:
case_id control_id
A g
B NA
C h
D e
E a
F b
The original dataset include more than 2,000 individuals, and some individuals have more than 30 matches. Due to the characteristic of some matching variables, propensity score matching is not what I am looking for. I will be really grateful for your help on this.
fun <- function(df, i = 1){
a <- df %>%
filter(num_controls_matched == i | num_cases_matched == i)
b <- df %>%
filter(!(case_id %in% a$case_id | control_id %in% a$control_id))
if (any(table(b$case_id) > 1)) fun(df, i + 1)
else rbind(a, b)[c('case_id', 'control_id')]
}
fun(df)
case_id control_id
1 A a
2 B b
3 C c
I can't think how to do this in a tidy fashion.
I have a table as follows:
tibble(
Min = c(1, 5, 12, 13, 19),
Max = c(3, 11, 12, 14, 19),
Value = c("a", "bb", "c", "d", "e" )
)
and I want to generate another table from it as shown below
tibble(
Row = c(1:3, 5:11, 12:12, 13:14, 19:19),
Value = c( rep("a", 3), rep("bb", 7), "c", "d", "d", "e" )
)
Grateful for any suggestions folk might have. The only 'solutions' which come to mind are a bit cumbersome.
1) If DF is the input then:
library(dplyr)
DF %>%
group_by(Value) %>%
group_modify(~ tibble(Row = seq(.$Min, .$Max))) %>%
ungroup
giving:
# A tibble: 14 x 2
Value Row
<chr> <int>
1 a 1
2 a 2
3 a 3
4 bb 5
5 bb 6
6 bb 7
7 bb 8
8 bb 9
9 bb 10
10 bb 11
11 c 12
12 d 13
13 d 14
14 e 19
2) This one creates a list column L containing tibbles and then unnests it. Duplicate Value elements are ok with this one.
library(dplyr)
library(tidyr)
DF %>%
rowwise %>%
summarize(L = list(tibble(Value, Row = seq(Min, Max)))) %>%
ungroup %>%
unnest(L)
This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))
I am trying to add two rows to the data frame.
Regarding the first row, its value in MODEL column should be X, total_value should be the sum of total value of rows, with the MODEL being A and C and total_frequency should be the sum of total_frequency of rows, with the MODEL being A and C.
In the second row, the value in MODEL column should be Z, total_value should be the sum of total_value of rows, with the MODEL being D, Fand E, and total_frequency should be the sum of total_frequency of rows, with the MODEL being D,Fand E.
I am stuck, as I do not know how to select specific values of MODEL and then sum these two other columns.
Here is my data
data.frame(MODEL=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), total_value= c(62, 54, 78, 38, 16, 75, 39, 13, 58, 37),
total_frequency = c(78, 83, 24, 13, 22, 52, 16, 16, 20, 72))
You can try with dplyr, calculating the "new rows", then put together with the data df:
library(dplyr)
first <- df %>%
# select the models you need
filter(MODEL %in% c("A","C")) %>%
# call them x
mutate(MODEL = 'X') %>%
# grouping
group_by(MODEL) %>%
# calculate the sums
summarise_all(sum)
# same with the second
second <- df %>%
filter(MODEL %in% c("D","F","E")) %>%
mutate(MODEL = 'Z') %>%
group_by(MODEL) %>% summarise_all(sum)
# put together
rbind(df, first, second)
# A tibble: 12 x 3
MODEL total_value total_frequency
1 A 62 78
2 B 54 83
3 C 78 24
4 D 38 13
5 E 16 22
6 F 75 52
7 G 39 16
8 H 13 16
9 I 58 20
10 J 37 72
11 X 140 102
12 Z 129 87
The following code is a straightforward solution to the problem.
i1 <- df1$MODEL %in% c("A", "C")
total_value <- sum(df1$total_value[i1])
total_frequency <- sum(df1$total_frequency[i1])
df1 <- rbind(df1, data.frame(MODEL = "X", total_value, total_frequency))
i2 <- df1$MODEL %in% c("D", "E", "F")
total_value <- sum(df1$total_value[i2])
total_frequency <- sum(df1$total_frequency[i2])
df1 <- rbind(df1, data.frame(MODEL = "Z", total_value, total_frequency))
df1
# MODEL total_value total_frequency
#1 A 62 78
#2 B 54 83
#3 C 78 24
#4 D 38 13
#5 E 16 22
#6 F 75 52
#7 G 39 16
#8 H 13 16
#9 I 58 20
#10 J 37 72
#11 X 140 102
#12 Z 129 87
It is also possible to write a function to avoid repeating the same code.
fun <- function(X, M, vals){
i1 <- X$MODEL %in% vals
total_value <- sum(X$total_value[i1])
total_frequency <- sum(X$total_frequency[i1])
rbind(X, data.frame(MODEL = M, total_value, total_frequency))
}
df1 <- fun(df1, M = "X", vals = c("A", "C"))
df1 <- fun(df1, M = "Z", vals = c("D", "E", "F"))
I am trying to calculate the median (but that could be substituted by similar metrics) by group for multiple columns based on subsets defined by other columns. This is direct follow-on question from this previous post of mine. I have attempted to incorporate calculating the median via aggregate into the Map(function(x,y) dosomething, x, y) solution kindly provided by #Frank, but that didn't work. Let me illustrate:
Calculate median for A and B by groups GRP1 and GRP2
df <- data.frame(GRP1 = c("A","A","A","A","A","A","B","B","B","B","B","B"), GRP2 = c("A","A","A","B","B","B","A","A","A","B","B","B"), A = c(0,4,6,7,0,1,9,0,0,8,3,4), B = c(6,0,4,8,6,7,0,9,9,7,3,0))
med <- aggregate(.~GRP1+GRP2,df,FUN=median)
Simple. Now add columns defining which rows to be used for calculating the median, i.e. rows with NAs should be dropped, column a defines which rows to be used for calculating the median in column A, same for columns b and B:
a <- c(1,4,7,3,NA,3,7,NA,NA,4,8,1)
b <- c(5,NA,7,9,5,6,NA,8,1,7,2,9)
df1 <- cbind(df,a,b)
As mentioned above, I have tried combining Map and aggregate, but that didn't work. I assume that Map doesn't know what to do with GRP1 and GRP2.
med1 <- Map(function(x,y) aggregate(.~GRP1+GRP2,df1[!is.na(y)],FUN=median), x=df1[,3:4], y=df1[, 5:6])
This is the result I'm looking for:
GRP1 GRP2 A B
1 A A 4 5
2 B A 9 9
3 A B 4 7
4 B B 4 3
Any help will be much appreciated!
Using data.table
library(data.table)
setDT(df1)
df1[, .(A = median(A[!is.na(a)]), B = median(B[!is.na(b)])), by = .(GRP1, GRP2)]
GRP1 GRP2 A B
1: A A 4 5
2: A B 4 7
3: B A 9 9
4: B B 4 3
Same logic in dplyr
library(dplyr)
df1 %>%
group_by(GRP1, GRP2) %>%
summarise(A = median(A[!is.na(a)]), B = median(B[!is.na(b)]))
The original df1:
df1 <- data.frame(
GRP1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
GRP2 = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
A = c(0, 4, 6, 7, 0, 1, 9, 0, 0, 8, 3, 4),
B = c(6, 0, 4, 8, 6, 7, 0, 9, 9, 7, 3, 0),
a = c(1, 4, 7, 3, NA, 3, 7, NA, NA, 4, 8, 1),
b = c(5, NA, 7, 9, 5, 6, NA, 8, 1, 7, 2, 9)
)
With dplyr:
library(dplyr)
df1 %>%
mutate(A = ifelse(is.na(a), NA, A),
B = ifelse(is.na(b), NA, B)) %>%
# I use this to put as NA the values we don't want to include
group_by(GRP1, GRP2) %>%
summarise(A = median(A, na.rm = T),
B = median(B, na.rm = T))
# A tibble: 4 x 4
# Groups: GRP1 [?]
GRP1 GRP2 A B
<fct> <fct> <dbl> <dbl>
1 A A 4 5
2 A B 4 7
3 B A 9 9
4 B B 4 3