Hi I'm currently using a large observational dataset to estimate the average effect of a treatment. To balance the treatment and the control groups, I matched individuals based on a series of variables by using the full_join command.
matched_sample <- full_join(case, control, by = matched_varaibles)
The matched sample ended up with many rows because some individuals were matched more than once. I documented the number of matches found for each individual. Here I present a simpler version:
case_id <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "E", "F", "F")
num_controls_matched <- c(7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 5, 5, 5, 5, 5, 2, 2, 1, 2, 2)
control_id <- c("a" , "b", "c", "d", "e", "f", "g", "a", "b", "e", "a", "b", "e", "f", "h", "a", "e", "a", "b", "e")
num_cases_matched <- c(5, 4, 1, 1, 5, 2, 1, 5, 4, 5, 5, 4, 5, 2, 1, 5, 5, 5, 4, 5)
case_id num_controls_matched control_id num_cases_matched
1 A 7 a 5
2 A 7 b 4
3 A 7 c 1
4 A 7 d 1
5 A 7 e 5
6 A 7 f 2
7 A 7 g 1
8 B 3 a 5
9 B 3 b 4
10 B 3 e 5
11 C 5 a 5
12 C 5 b 4
13 C 5 e 5
14 C 5 f 2
15 C 5 h 1
16 D 2 a 5
17 D 2 e 5
18 E 1 a 5
19 F 2 b 4
20 F 2 e 5
where case_id and control_id are IDs of those from the treatment and the control groups, num_controls_matched is the number of matches found for the treated individuals, and num_cases_matched is the number of matches found for individuals in the control group.
I would like to keep as many treated individuals in the sample as possible. I would also like to prioritise the matches for the "less popular" individuals. For example, the treated individual E was only matched to 1 control, so the match E-a should be prioritised. Then, both D and F have 2 matches. Because b has only 4 matches whilst a and e both have 5 matches, F-b should be prioritised. Therefore, D can only be matched with e. The next one should be B because it has 3 matches. However, since a, b and e have already been matched with D, E and F, B has no match (NA). C is matched with h because h has only 1 match. A can be matched with c, d, or g.
I would like to construct data frame to indicate the final 1:1 matches:
case_id control_id
A g
B NA
C h
D e
E a
F b
The original dataset include more than 2,000 individuals, and some individuals have more than 30 matches. Due to the characteristic of some matching variables, propensity score matching is not what I am looking for. I will be really grateful for your help on this.
fun <- function(df, i = 1){
a <- df %>%
filter(num_controls_matched == i | num_cases_matched == i)
b <- df %>%
filter(!(case_id %in% a$case_id | control_id %in% a$control_id))
if (any(table(b$case_id) > 1)) fun(df, i + 1)
else rbind(a, b)[c('case_id', 'control_id')]
}
fun(df)
case_id control_id
1 A a
2 B b
3 C c
Related
I have the following data frame, describing conditions each patient has (each can have more than 1):
df <- structure(list(patient = c(1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6,
6, 7, 7, 8, 8, 9, 9, 10), condition = c("A", "A", "B", "B", "D",
"C", "A", "C", "C", "B", "D", "B", "A", "A", "C", "B", "C", "D",
"C", "D")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to create a "confusion matrix", which in this case will be a 4x4 matrix where AxA will have the value 5 (5 patients have condition A), AxB will have the value 2 (two patients have A and B), and so on.
How can I achieve this?
You can join the table itself and produce new calculation.
library(dplyr)
df2 <- df
df2 <- inner_join(df,df, by = "patient")
table(df2$condition.x,df2$condition.y)
A B C D
A 5 2 2 1
B 2 5 3 2
C 2 3 6 2
D 1 2 2 4
Here is a base R answer using outer -
count_patient <- function(x, y) {
length(intersect(df$patient[df$condition == x],
df$patient[df$condition == y]))
}
vec <- sort(unique(df$condition))
res <- outer(vec, vec, Vectorize(count_patient))
dimnames(res) <- list(vec, vec)
res
# A B C D
#A 5 2 2 1
#B 2 5 3 2
#C 2 3 6 2
#D 1 2 2 4
This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))
I have a dataset which is of the following form:-
a <- data.frame(X1=c("A", "B", "C", "A", "B", "C"),
X2=c("B", "C", "C", "A", "A", "B"),
X3=c("B", "E", "A", "A", "A", "B"),
X4=c("E", "C", "A", "A", "A", "C"),
X5=c("A", "C", "C", "A", "B", "B")
)
And I have another set of the following form:-
b <- data.frame(col_1=c("ASD", "ASD", "BSD", "BSD"),
col_2=c(1, 1, 1, 1),
col_3=c(12, 12, 31, 21),
col_4=("A", "B", "B", "A")
)
What I want to do is to take the column col_4 from set b and match row wise in set a, so that it tell me which row has how many elements from col_4 in a new column. The name of the new column does not matters.
For ex:- The first and fifth row in set a has all the elements of col_4 from set b.
Also, duplicates shouldn't be found. For ex. sixth row in set a has 3 "B"s. But since col_4 from set b has only two "B"s, it should tell me 2 and not 3.
Expected output is of the form:-
c <- data.frame(X1=c("A", "B", "C", "A", "B", "C"),
X2=c("B", "C", "C", "A", "A", "B"),
X3=c("B", "E", "A", "A", "A", "B"),
X4=c("E", "C", "A", "A", "A", "C"),
X5=c("A", "C", "C", "A", "B", "B"),
found=c(4, 1, 2, 2, 4, 2)
)
We can use vecsets::vintersect which takes care of duplicates.
Using apply row-wise we can count how many common values are there between b$col4 and each row in a.
apply(a, 1, function(x) length(vecsets::vintersect(b$col_4, x)))
#[1] 4 1 2 2 4 2
An option using data.table:
library(data.table)
#convert a into a long format
m <- melt(setDT(a)[, rn:=.I], id.vars="rn", value.name="col_4")
#order by row number and create an index for identical occurrences in col_4
setorder(m, rn, col_4)[, vidx := rowid(col_4), rn]
#create a similar index for b
setDT(b, key="col_4")[, vidx := rowid(col_4)]
#count occurrences and lookup this count into original data
a[b[m, on=.(col_4, vidx), nomatch=0L][, .N, rn], on=.(rn), found := N]
output:
X1 X2 X3 X4 X5 rn found
1: A B B E A 1 4
2: B C E C C 2 1
3: C C A A C 3 2
4: A A A A A 4 2
5: B A A A B 5 4
6: C B B C B 6 2
Another idea to operate on sets efficiently is to count and compare the element occurences of b$col_4 in each row of a:
b1 = c(table(b$col_4))
#b1
#A B
#2 2
a1 = table(factor(as.matrix(a), names(b1)), row(a))
#a1
#
# 1 2 3 4 5 6
# A 2 0 2 5 3 0
# B 2 1 0 0 2 3
Finally, identify the least amount of occurences per element (for each row) and sum:
colSums(pmin(a1, b1))
#1 2 3 4 5 6
#4 1 2 2 4 2
In case of a larger dimension a "data.frame" and more elements, Matrix::sparseMatrix offers an appropriate alternative:
library(Matrix)
a.fac = factor(as.matrix(a), names(b1))
.i = as.integer(a.fac)
.j = c(row(a))
noNA = !is.na(.i) ## need to remove NAs manually
.i = .i[noNA]
.j = .j[noNA]
a1 = sparseMatrix(i = .i, j = .j, x = 1L, dimnames = list(names(b1), 1:nrow(a)))
a1
#2 x 6 sparse Matrix of class "dgCMatrix"
# 1 2 3 4 5 6
#A 2 . 2 5 3 .
#B 2 1 . . 2 3
colSums(pmin(a1, b1))
#1 2 3 4 5 6
#4 1 2 2 4 2
I have a data table with a number of observations - each has a non unique ID and integer X value.
grouped by ID. I want the sum of X for each group to be larger than 10. To achieve this, each observation is allowed to change its ID as long as the sum of X for the original group does not go below 10.
Below is an example of what I mean and a very manual solution:
# sample data
input <-data.table(ID = c("A", "A", "A", "B", "B", "B" ,
"C", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5,
6, 10, 2, 3, 3, 4))
# summarise X by ID
input[, .(X = sum(X)), by = ID]
ID X
A 5
B 11
C 18
D 10
# what the output should look like
output <- data.table(ID = c("A", "A", "A", "B", "B", "B" ,
"A", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5,
6, 10, 2, 3, 3, 4))
output[, .(X = sum(X)), by = ID]
ID X
A 11
B 11
C 12
D 10
output
ID X
A 1
A 3
A 1
B 5
B 1
B 5
A 6 - this observation changed ID from C to A to get group A to 11
C 10
C 2
This is obviously a very simple example as only group C has enough spare capacity to donate observations. In reality there might be cases when multiple groups are able to give away observations and when more than one observation has to be modified, however there will only ever be up to 4 groups between which movement can happen.
Is it possible to automate this process in R?
Here is a possible heuristic approach. First, subset to those IDs with less than or greater than 10 (say, there are N of these IDs). Then, distribute the values from these IDs in descending order into smallest of these N buckets.
b <- 10
input <- data.table(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5, 6, 10, 2, 3, 3, 4))
#find the IDs that need redistribution
scramID <- input[, sum(X), by = ID][V1!=b, ID]
#extract the values for those IDs
x <- input[ID %in% scramID, sort(X, decreasing=TRUE)]
#create list of empty vectors as buckets
l <- replicate(length(scramID), vector('integer'))
#assign the values starting with largest to the bucket with the smallest sum
for (k in x) {
i <- which.min(lapply(l, sum))
l[[i]] <- c(l[[i]], k)
}
#rbind to original dataset to get result
names(l) <- scramID
ans <- rbindlist(list(
setnames(setDT(stack(l)), c("X","ID")),
input[!ID %in% scramID]
), use.names=TRUE)
output:
X ID
1: 10 A
2: 1 A
3: 1 A
4: 6 B
5: 3 B
6: 2 B
7: 5 C
8: 5 C
9: 1 C
10: 3 D
11: 3 D
12: 4 D
and the distribution of values:
> ans[, sum(X), by = ID]
ID V1
1: A 12
2: B 11
3: C 11
4: D 10
For this example, the IDs chosen for redistribution are A, B and C. There are three buckets. The values for these IDs in descending order is 10, 6, 5, 5, 3, 2, 1, 1 and 1.
In the first 3 iterations, 10, 6 and 5 are assigned to first, second and third buckets respectively.
The next number is 5 and this is assigned to the third bucket with the number 5 since it has the smallest sum (5) of the three buckets.
The next number, 3, is assigned to the second bucket since it has the smallest sum (6) of the three buckets.
And so on for the rest of the numbers.
I have the following dataframe:
a a a b c c d e a a b b b e e d d
The required result should be
a b c d e a b e d
It means no two consecutive rows should have same value. How it can be done without using loop.
As my data set is quite huge, looping is taking lot of time to execute.
The dataframe structure is like the following
a 1
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10
Result:
a 1
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4
Its should delete the entire row.
One easy way is to use rle:
Here's your sample data:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items
rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").
rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Update: For a data.frame
If you are working with a data.frame, try something like the following:
## Sample data
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10)
)
## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1] 1 4 5 7 8 9 11 13 15
mydf[Y, ]
# V1 V2
# 1 a 1
# 4 b 2
# 5 c 4
# 7 d 3
# 8 e 9
# 9 a 4
# 11 b 10
# 13 e 2
# 15 d 4
Update 2
The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:
library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
# rleid V2
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 3
# 5: 5 9
# 6: 6 4
# 7: 7 10
# 8: 8 2
# 9: 9 4
library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
EDIT: For data.frame
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10),
stringsAsFactors=FALSE)
dplyr solution is one liner:
mydf %>% filter(V1!= lag(V1, default="1"))
# V1 V2
#1 a 1
#2 b 2
#3 c 4
#4 d 3
#5 e 9
#6 a 4
#7 b 10
#8 e 2
#9 d 4
post scriptum
lead(x,1) suggested by #Carl Witthoft iterates in reverse order.
leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]
# V1 V2
#3 a 3
#4 b 2
#6 c 1
#7 d 3
#8 e 9
#10 a 8
#12 b 199
#14 e 5
#16 d 10
With base R, I like funny algorithmics:
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Much as I like,... errr, love rle , here's a shootoff:
EDIT: Can't figure out exactly what's up with dplyr so I used dplyr::lead . I'm on OSX, R3.1.2, and latest dplyr from CRAN.
xlet<-sample(letters,1e5,rep=T)
rleit<-function(x) rle(x)$values
lagit<-function(x) x[x!=lead(x, default=1)]
tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]
microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
Unit: milliseconds
expr min lq median uq max neval
rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657 20
lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940 20
tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840 20
Tidyverse solution:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x)
x |>
mutate(id = consecutive_id(x)) |>
distinct(x, id)
In addition, if there is another column y associated with the consecutive values column, this solution allows some flexibility:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x, y = runif(length(x)))
x |>
group_by(id = consecutive_id(x)) |>
slice_min(y)
We can choose between the different slice functions, like slice_max, slice_min, slice_head, and slice_tail.
This Stack Overflow thread appeared in the second edition of R4DS, in the Numbers chapter of the book.