Rearrange observations to add up to desired value - data.table? - r

I have a data table with a number of observations - each has a non unique ID and integer X value.
grouped by ID. I want the sum of X for each group to be larger than 10. To achieve this, each observation is allowed to change its ID as long as the sum of X for the original group does not go below 10.
Below is an example of what I mean and a very manual solution:
# sample data
input <-data.table(ID = c("A", "A", "A", "B", "B", "B" ,
"C", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5,
6, 10, 2, 3, 3, 4))
# summarise X by ID
input[, .(X = sum(X)), by = ID]
ID X
A 5
B 11
C 18
D 10
# what the output should look like
output <- data.table(ID = c("A", "A", "A", "B", "B", "B" ,
"A", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5,
6, 10, 2, 3, 3, 4))
output[, .(X = sum(X)), by = ID]
ID X
A 11
B 11
C 12
D 10
output
ID X
A 1
A 3
A 1
B 5
B 1
B 5
A 6 - this observation changed ID from C to A to get group A to 11
C 10
C 2
This is obviously a very simple example as only group C has enough spare capacity to donate observations. In reality there might be cases when multiple groups are able to give away observations and when more than one observation has to be modified, however there will only ever be up to 4 groups between which movement can happen.
Is it possible to automate this process in R?

Here is a possible heuristic approach. First, subset to those IDs with less than or greater than 10 (say, there are N of these IDs). Then, distribute the values from these IDs in descending order into smallest of these N buckets.
b <- 10
input <- data.table(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5, 6, 10, 2, 3, 3, 4))
#find the IDs that need redistribution
scramID <- input[, sum(X), by = ID][V1!=b, ID]
#extract the values for those IDs
x <- input[ID %in% scramID, sort(X, decreasing=TRUE)]
#create list of empty vectors as buckets
l <- replicate(length(scramID), vector('integer'))
#assign the values starting with largest to the bucket with the smallest sum
for (k in x) {
i <- which.min(lapply(l, sum))
l[[i]] <- c(l[[i]], k)
}
#rbind to original dataset to get result
names(l) <- scramID
ans <- rbindlist(list(
setnames(setDT(stack(l)), c("X","ID")),
input[!ID %in% scramID]
), use.names=TRUE)
output:
X ID
1: 10 A
2: 1 A
3: 1 A
4: 6 B
5: 3 B
6: 2 B
7: 5 C
8: 5 C
9: 1 C
10: 3 D
11: 3 D
12: 4 D
and the distribution of values:
> ans[, sum(X), by = ID]
ID V1
1: A 12
2: B 11
3: C 11
4: D 10
For this example, the IDs chosen for redistribution are A, B and C. There are three buckets. The values for these IDs in descending order is 10, 6, 5, 5, 3, 2, 1, 1 and 1.
In the first 3 iterations, 10, 6 and 5 are assigned to first, second and third buckets respectively.
The next number is 5 and this is assigned to the third bucket with the number 5 since it has the smallest sum (5) of the three buckets.
The next number, 3, is assigned to the second bucket since it has the smallest sum (6) of the three buckets.
And so on for the rest of the numbers.

Related

Conditional rolling sum based on another column

I would like to compute the conditional rolling sum of a column, but based on the values of another column.
I have a table like this:
data_frame <- data.frame( category1 = c("A", "A", "A", "B", "B", "B", "A", "A", "B"),
category2 = c("B", "B", "B", "A", "A", "A", "B", "B", "A"),
value = c(1, 2, 1, 2, 1, 5, 3, 4, 2),
desired_output = c(0, 0, 0, 4, 4, 4, 8, 8, 11))
data_frame2 <- data_frame %>%
group_by(category1) %>%
mutate(cumsum = cumsum(value))
category1 category2 value cumsum desired_output
A B 1 1 0
A B 2 3 0
A B 1 4 0
B A 2 2 4
B A 1 3 4
B A 5 8 4
A B 3 7 8
A B 4 11 8
B A 2 10 11
I am able to compute the rolling sum of the value based on category1 or category2 using cumsum, but I would like a column which calculates a rolling sum of the value column when category1 equals the current value of category2. For example, in the last row of the above example it sums the value of all the above rows when category1 == A, as the current value of category2 is A.
I have tried various hacky ifelse/lag/fill solutions but nothing gets close to what I need. I have also tried adding a conditional into the ave function, as below, but not sure what the syntax should be...
data_frame2$desired_output <- ave(data_frame2$value, data_frame2$category1 = data_frame2$category2, FUN=cumsum)
Thanks in advance - first question so apologies about anything I missed/got wrong!

1:1 Matching with multiple matches between treatment and control groups

Hi I'm currently using a large observational dataset to estimate the average effect of a treatment. To balance the treatment and the control groups, I matched individuals based on a series of variables by using the full_join command.
matched_sample <- full_join(case, control, by = matched_varaibles)
The matched sample ended up with many rows because some individuals were matched more than once. I documented the number of matches found for each individual. Here I present a simpler version:
case_id <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "E", "F", "F")
num_controls_matched <- c(7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 5, 5, 5, 5, 5, 2, 2, 1, 2, 2)
control_id <- c("a" , "b", "c", "d", "e", "f", "g", "a", "b", "e", "a", "b", "e", "f", "h", "a", "e", "a", "b", "e")
num_cases_matched <- c(5, 4, 1, 1, 5, 2, 1, 5, 4, 5, 5, 4, 5, 2, 1, 5, 5, 5, 4, 5)
case_id num_controls_matched control_id num_cases_matched
1 A 7 a 5
2 A 7 b 4
3 A 7 c 1
4 A 7 d 1
5 A 7 e 5
6 A 7 f 2
7 A 7 g 1
8 B 3 a 5
9 B 3 b 4
10 B 3 e 5
11 C 5 a 5
12 C 5 b 4
13 C 5 e 5
14 C 5 f 2
15 C 5 h 1
16 D 2 a 5
17 D 2 e 5
18 E 1 a 5
19 F 2 b 4
20 F 2 e 5
where case_id and control_id are IDs of those from the treatment and the control groups, num_controls_matched is the number of matches found for the treated individuals, and num_cases_matched is the number of matches found for individuals in the control group.
I would like to keep as many treated individuals in the sample as possible. I would also like to prioritise the matches for the "less popular" individuals. For example, the treated individual E was only matched to 1 control, so the match E-a should be prioritised. Then, both D and F have 2 matches. Because b has only 4 matches whilst a and e both have 5 matches, F-b should be prioritised. Therefore, D can only be matched with e. The next one should be B because it has 3 matches. However, since a, b and e have already been matched with D, E and F, B has no match (NA). C is matched with h because h has only 1 match. A can be matched with c, d, or g.
I would like to construct data frame to indicate the final 1:1 matches:
case_id control_id
A g
B NA
C h
D e
E a
F b
The original dataset include more than 2,000 individuals, and some individuals have more than 30 matches. Due to the characteristic of some matching variables, propensity score matching is not what I am looking for. I will be really grateful for your help on this.
fun <- function(df, i = 1){
a <- df %>%
filter(num_controls_matched == i | num_cases_matched == i)
b <- df %>%
filter(!(case_id %in% a$case_id | control_id %in% a$control_id))
if (any(table(b$case_id) > 1)) fun(df, i + 1)
else rbind(a, b)[c('case_id', 'control_id')]
}
fun(df)
case_id control_id
1 A a
2 B b
3 C c

confusion between categories in dplyr

I have the following data frame, describing conditions each patient has (each can have more than 1):
df <- structure(list(patient = c(1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6,
6, 7, 7, 8, 8, 9, 9, 10), condition = c("A", "A", "B", "B", "D",
"C", "A", "C", "C", "B", "D", "B", "A", "A", "C", "B", "C", "D",
"C", "D")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to create a "confusion matrix", which in this case will be a 4x4 matrix where AxA will have the value 5 (5 patients have condition A), AxB will have the value 2 (two patients have A and B), and so on.
How can I achieve this?
You can join the table itself and produce new calculation.
library(dplyr)
df2 <- df
df2 <- inner_join(df,df, by = "patient")
table(df2$condition.x,df2$condition.y)
A B C D
A 5 2 2 1
B 2 5 3 2
C 2 3 6 2
D 1 2 2 4
Here is a base R answer using outer -
count_patient <- function(x, y) {
length(intersect(df$patient[df$condition == x],
df$patient[df$condition == y]))
}
vec <- sort(unique(df$condition))
res <- outer(vec, vec, Vectorize(count_patient))
dimnames(res) <- list(vec, vec)
res
# A B C D
#A 5 2 2 1
#B 2 5 3 2
#C 2 3 6 2
#D 1 2 2 4

Logic for filtering dependent on two columns [duplicate]

This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))

Create Weighted Graph from Group Membership Data

Suppose I have a dataset in R indicating the individuals within groups. Here is an example:
grp <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5)
ind <- c("A", "C", "D", "B", "C", "D", "E", "A", "D", "E", "B", "F", "E", "A", "F")
data.frame(grp, ind)
So, the data look like this:
grp ind
1 1 A
2 1 C
3 1 D
4 2 B
5 2 C
6 2 D
7 2 E
8 3 A
9 3 D
10 3 E
11 4 B
12 4 F
13 4 E
14 5 A
15 5 F
So, group 1 is composed of individuals (A, C, D), group 2 is composed of individuals (B, C, D, E), and so on. I would like to create a network graph that shows how individuals are connected with each other. Within a group, all individuals are connected by edges. The thickness of the edges should reflect how often two individuals are connected to each other.
With:
pairs <- do.call(rbind, sapply(split(ind, grp), function(x) t(combn(x,2))))
I can obtain a matrix with all pairwise edges, which I can plot with the igraph package:
library(igraph)
plot(graph.edgelist(pairs, directed=FALSE), edge.curved=FALSE)
But is there a way of making the thickness of the edges proportional to how often a particular pairing occurred?
#hrbrmstr's solution builds a second graph to get the edge weights. You could also do this beforehand by operating on pairs:
# Count unique edge pairs
library(plyr)
weighted <- ddply(data.frame(pairs), .(X1, X2), count)
# Plot
library(igraph)
g <- graph.edgelist(as.matrix(weighted[,1:2]), directed=FALSE)
plot(g, edge.curved=FALSE, edge.width=weighted$freq*3)
Could probably tighten this up a bit, but…
library(igraph)
grp <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5)
ind <- c("A", "C", "D", "B", "C", "D", "E", "A", "D", "E", "B", "F", "E", "A", "F")
pairs <- do.call(rbind, sapply(split(ind, grp), function(x) t(combn(x,2))))
g <- graph.edgelist(pairs, directed=FALSE)
m <- get.adjacency(g) # get the adjacency matrix
net <- graph.adjacency(m,
mode="undirected",
weighted=TRUE,
diag=FALSE)
print(E(net)$weight) # just for kicks
## [1] 1 2 1 1 2 1 1 1 2 2 1 1
set.seed(1492) # ensures consistent layout every run
plot.igraph(net,
vertex.label=V(net)$name,
layout=layout.fruchterman.reingold,
edge.color="black",
edge.width=E(net)$weight*3)
library(igraph)
grp <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5)
ind <- c("A", "C", "D", "B", "C", "D", "E", "A", "D", "E", "B", "F", "E", "A", "F")
data.frame(grp, ind)
pairs <- do.call(rbind, sapply(split(ind, grp), function(x) t(combn(x,2))))
g = graph.data.frame(pairs, directed=FALSE)
E(g)$weight = c(1, grp)
plot.igraph(g, edge.width=E(g)$weight)
I am not sure if I am putting the weight correctly but I hope you find my code helpful.
You can find more helpful sample codes here on weight-edge or and here on weight-node
Here is a solution that creates the bipartite graph, and then projects it to the individuals.
g <- graph.edgelist(cbind(grp, ind), directed=FALSE)
V(g)$type <- V(g)$name %in% grp
ind_g <- bipartite.projection(g)[[1]]
E(ind_g)$width <- E(ind_g)$weight * 3
plot(ind_g)

Resources