I have the following data frame, describing conditions each patient has (each can have more than 1):
df <- structure(list(patient = c(1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6,
6, 7, 7, 8, 8, 9, 9, 10), condition = c("A", "A", "B", "B", "D",
"C", "A", "C", "C", "B", "D", "B", "A", "A", "C", "B", "C", "D",
"C", "D")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to create a "confusion matrix", which in this case will be a 4x4 matrix where AxA will have the value 5 (5 patients have condition A), AxB will have the value 2 (two patients have A and B), and so on.
How can I achieve this?
You can join the table itself and produce new calculation.
library(dplyr)
df2 <- df
df2 <- inner_join(df,df, by = "patient")
table(df2$condition.x,df2$condition.y)
A B C D
A 5 2 2 1
B 2 5 3 2
C 2 3 6 2
D 1 2 2 4
Here is a base R answer using outer -
count_patient <- function(x, y) {
length(intersect(df$patient[df$condition == x],
df$patient[df$condition == y]))
}
vec <- sort(unique(df$condition))
res <- outer(vec, vec, Vectorize(count_patient))
dimnames(res) <- list(vec, vec)
res
# A B C D
#A 5 2 2 1
#B 2 5 3 2
#C 2 3 6 2
#D 1 2 2 4
Related
I am trying to "clean" a dataset that has many "empty" rows deleted, however, I want these empty rows back (and adding NA). Here is a toy dataset:
values <- rnorm(12)
data <- data.frame(ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5),
event = c("A", "B", "C", "A", "B", "A", "B", "C", "B", "A", "B", "C"),
value = values) #values are random
What I want is to insert rows that are missing, i.e. ID 2 is missing group C, and 4 is missing A and C. And the expected result is as follows:
data_expanded <- data.frame(ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
event = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"),
value = c(values[1:5], NA, values[6:8], NA, values[9], NA, values[10:12]))
The rows with NA can be added at the end of the data frame (not necessarily to be grouped as in the example I provided). My real dataset has many rows, therefore, a method that is memory-efficient is highly appreciated. I do prefer the method using R, tidyr (or tidyverse).
tidyr::complete() does exactly what you want:
library(tidyr)
values <- rnorm(12)
data <- data.frame(ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5),
event = c("A", "B", "C", "A", "B", "A", "B", "C", "B", "A", "B", "C"),
value = values) #values are random
data |>
complete(ID, event)
#> # A tibble: 15 × 3
#> ID event value
#> <dbl> <chr> <dbl>
#> 1 1 A 0.397
#> 2 1 B -0.595
#> 3 1 C 0.743
#> 4 2 A -0.0421
#> 5 2 B 1.47
#> 6 2 C NA
#> 7 3 A 0.218
#> 8 3 B -0.525
#> 9 3 C 1.05
#> 10 4 A NA
#> 11 4 B -1.79
#> 12 4 C NA
#> 13 5 A 1.18
#> 14 5 B -1.39
#> 15 5 C 0.748
Created on 2022-12-12 with reprex v2.0.2
Hi I'm currently using a large observational dataset to estimate the average effect of a treatment. To balance the treatment and the control groups, I matched individuals based on a series of variables by using the full_join command.
matched_sample <- full_join(case, control, by = matched_varaibles)
The matched sample ended up with many rows because some individuals were matched more than once. I documented the number of matches found for each individual. Here I present a simpler version:
case_id <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "E", "F", "F")
num_controls_matched <- c(7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 5, 5, 5, 5, 5, 2, 2, 1, 2, 2)
control_id <- c("a" , "b", "c", "d", "e", "f", "g", "a", "b", "e", "a", "b", "e", "f", "h", "a", "e", "a", "b", "e")
num_cases_matched <- c(5, 4, 1, 1, 5, 2, 1, 5, 4, 5, 5, 4, 5, 2, 1, 5, 5, 5, 4, 5)
case_id num_controls_matched control_id num_cases_matched
1 A 7 a 5
2 A 7 b 4
3 A 7 c 1
4 A 7 d 1
5 A 7 e 5
6 A 7 f 2
7 A 7 g 1
8 B 3 a 5
9 B 3 b 4
10 B 3 e 5
11 C 5 a 5
12 C 5 b 4
13 C 5 e 5
14 C 5 f 2
15 C 5 h 1
16 D 2 a 5
17 D 2 e 5
18 E 1 a 5
19 F 2 b 4
20 F 2 e 5
where case_id and control_id are IDs of those from the treatment and the control groups, num_controls_matched is the number of matches found for the treated individuals, and num_cases_matched is the number of matches found for individuals in the control group.
I would like to keep as many treated individuals in the sample as possible. I would also like to prioritise the matches for the "less popular" individuals. For example, the treated individual E was only matched to 1 control, so the match E-a should be prioritised. Then, both D and F have 2 matches. Because b has only 4 matches whilst a and e both have 5 matches, F-b should be prioritised. Therefore, D can only be matched with e. The next one should be B because it has 3 matches. However, since a, b and e have already been matched with D, E and F, B has no match (NA). C is matched with h because h has only 1 match. A can be matched with c, d, or g.
I would like to construct data frame to indicate the final 1:1 matches:
case_id control_id
A g
B NA
C h
D e
E a
F b
The original dataset include more than 2,000 individuals, and some individuals have more than 30 matches. Due to the characteristic of some matching variables, propensity score matching is not what I am looking for. I will be really grateful for your help on this.
fun <- function(df, i = 1){
a <- df %>%
filter(num_controls_matched == i | num_cases_matched == i)
b <- df %>%
filter(!(case_id %in% a$case_id | control_id %in% a$control_id))
if (any(table(b$case_id) > 1)) fun(df, i + 1)
else rbind(a, b)[c('case_id', 'control_id')]
}
fun(df)
case_id control_id
1 A a
2 B b
3 C c
I am working with the R programming language. Suppose I have the following data frame:
a = rnorm(100,10,1)
b = rnorm(100,10,5)
c = rnorm(100,10,10)
my_data_2 = data.frame(a,b,c)
my_data_2$group = as.factor(C)
My Question: Suppose I want to add an ID column to this data frame that ranks the first observation as "100" and increases the ID by 1 for each new column. I tried to do this as follows:
my_data_2$id = seq(101, 200, by = 1)
However, this "corrupted" the data frame:
head(my_data_2)
a b c
1 10.381397 9.534634 12.8330946
2 10.326785 6.397006 8.1217063
3 8.333354 11.474064 11.6035562
4 9.583789 12.096404 18.2764387
5 9.581740 12.302016 4.0601871
6 11.772943 9.151642 -0.3686874
group
1 c(9.98552413605153, 9.53807731118048, 6.92589246998173, 8.97095368638206, 9.70249918748529, 10.6161773148626, 9.2514231659343, 10.6566757899233, 10.2351848084123, 9.45970725813352, 9.15347719257448, 9.30428244749624, 8.43075784609759, 11.1200169905262, 11.3493313166827, 8.86895968334901, 9.13208319045466, 9.70062759133717)
2 c(8.90358954387628, 13.8756093430144, 12.9970566311467, 10.4227745183785, 21.3259516051226, 4.88590162247496, 10.260282181, 14.092109840631, 7.37839577680487, 9.09764173775965, 15.1636139760987, 9.9773055885761, 8.29361737323061, 8.61361852648607, 12.6807897406641, 0.00863359720839085, 10.7660528147358, 9.79616528370632)
3 c(25.8063583646201, -11.5722310383483, 8.56096791164312, 12.2858029391835, -0.312392781809937, 0.946343715084028, 2.45881422753051, 7.26197515743391, 0.333766891336273, 14.9149659649045, -4.55483090530928, -19.8075232688082, 16.59106194569, 18.7377329188129, 1.1771203751127, -6.19019973790205, -5.02277721344565, 23.3363430334739)
4 c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
5 c("B", "B", "B", "A", "B", "B", "B", "B", "B", "B", "B", "A", "B", "B", "B", "B", "B", "B")
6 c("B", "B", "B", "B", "B", "A", "B", "B", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B")
id
1 101
2 102
3 103
4 104
5 105
6 106
Can someone please show me how to fix this problem?
Thanks!
Your problem isn‘t your ID column, your problem is where you define your group variable. You call as.factor(C) (note the uppercase C), but the column of your data frame is a lowercase c. So I guess you have defined another object C outsode of your data frame, that now „corrupts“ your data frame.
You maybe want to do:
my_data_2$group <- as.factor(my_data_2$c)
I was able to figure out the answer!
a = rnorm(100,10,1)
b = rnorm(100,10,5)
c = rnorm(100,10,10)
my_data_2 = data.frame(a,b,c)
my_data_2$group = as.factor("C")
my_data_2$id = seq(101, 200, by = 1)
head(my_data_2)
a b c group id
1 9.436773 10.712568 3.7699748 C 101
2 10.265810 3.408589 11.9230024 C 102
3 10.503245 12.197000 8.3620889 C 103
4 9.279878 7.007812 16.8268852 C 104
5 10.683518 8.039032 5.2287997 C 105
6 11.097258 10.313103 0.4988398 C 106
This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have data in this format:
How can I re-organize the data with R in the following format?
In other words: Create a new column for every single observation and paste a simple count if the observation occurs for the specific group.
This is most easily done using the tidyr package:
library(tidyr)
dat <- data.frame(letter = c("A", "A", "A", "A",
"B", "B", "B", "C",
"C", "C", "C", "D"),
number = c(2, 3, 4,5, 4, 5, 6, 1, 3, 5, 7, 1),
value = 1)
spread(dat, number, value)
dat <- data.frame(letter = c("A", "A", "A", "A",
"B", "B", "B", "C",
"C", "C", "C", "D"),
number = c(2, 3, 4,5, 4, 5, 6, 1, 3, 5, 7, 1))
I would like to provide an R base solution (maybe just for fun...), based on matrix indexing.
lev <- unique(dat[[1L]]); k <- length(lev) ## unique levels
x <- dat[[2L]]; p <- max(x) ## column position
z <- matrix(0L, nrow = k, ncol = p, dimnames = list(lev, seq_len(p))) ## initialization
z[cbind(match(dat[[1L]], lev), dat[[2L]])] <- 1L ## replacement
z ## display
# 1 2 3 4 5 6 7
#A 0 1 1 1 1 0 0
#B 0 0 0 1 1 1 0
#C 1 0 1 0 1 0 1
#D 1 0 0 0 0 0 0