Suppose I have a dataset in R indicating the individuals within groups. Here is an example:
grp <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5)
ind <- c("A", "C", "D", "B", "C", "D", "E", "A", "D", "E", "B", "F", "E", "A", "F")
data.frame(grp, ind)
So, the data look like this:
grp ind
1 1 A
2 1 C
3 1 D
4 2 B
5 2 C
6 2 D
7 2 E
8 3 A
9 3 D
10 3 E
11 4 B
12 4 F
13 4 E
14 5 A
15 5 F
So, group 1 is composed of individuals (A, C, D), group 2 is composed of individuals (B, C, D, E), and so on. I would like to create a network graph that shows how individuals are connected with each other. Within a group, all individuals are connected by edges. The thickness of the edges should reflect how often two individuals are connected to each other.
With:
pairs <- do.call(rbind, sapply(split(ind, grp), function(x) t(combn(x,2))))
I can obtain a matrix with all pairwise edges, which I can plot with the igraph package:
library(igraph)
plot(graph.edgelist(pairs, directed=FALSE), edge.curved=FALSE)
But is there a way of making the thickness of the edges proportional to how often a particular pairing occurred?
#hrbrmstr's solution builds a second graph to get the edge weights. You could also do this beforehand by operating on pairs:
# Count unique edge pairs
library(plyr)
weighted <- ddply(data.frame(pairs), .(X1, X2), count)
# Plot
library(igraph)
g <- graph.edgelist(as.matrix(weighted[,1:2]), directed=FALSE)
plot(g, edge.curved=FALSE, edge.width=weighted$freq*3)
Could probably tighten this up a bit, but…
library(igraph)
grp <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5)
ind <- c("A", "C", "D", "B", "C", "D", "E", "A", "D", "E", "B", "F", "E", "A", "F")
pairs <- do.call(rbind, sapply(split(ind, grp), function(x) t(combn(x,2))))
g <- graph.edgelist(pairs, directed=FALSE)
m <- get.adjacency(g) # get the adjacency matrix
net <- graph.adjacency(m,
mode="undirected",
weighted=TRUE,
diag=FALSE)
print(E(net)$weight) # just for kicks
## [1] 1 2 1 1 2 1 1 1 2 2 1 1
set.seed(1492) # ensures consistent layout every run
plot.igraph(net,
vertex.label=V(net)$name,
layout=layout.fruchterman.reingold,
edge.color="black",
edge.width=E(net)$weight*3)
library(igraph)
grp <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5)
ind <- c("A", "C", "D", "B", "C", "D", "E", "A", "D", "E", "B", "F", "E", "A", "F")
data.frame(grp, ind)
pairs <- do.call(rbind, sapply(split(ind, grp), function(x) t(combn(x,2))))
g = graph.data.frame(pairs, directed=FALSE)
E(g)$weight = c(1, grp)
plot.igraph(g, edge.width=E(g)$weight)
I am not sure if I am putting the weight correctly but I hope you find my code helpful.
You can find more helpful sample codes here on weight-edge or and here on weight-node
Here is a solution that creates the bipartite graph, and then projects it to the individuals.
g <- graph.edgelist(cbind(grp, ind), directed=FALSE)
V(g)$type <- V(g)$name %in% grp
ind_g <- bipartite.projection(g)[[1]]
E(ind_g)$width <- E(ind_g)$weight * 3
plot(ind_g)
Related
Hi I'm currently using a large observational dataset to estimate the average effect of a treatment. To balance the treatment and the control groups, I matched individuals based on a series of variables by using the full_join command.
matched_sample <- full_join(case, control, by = matched_varaibles)
The matched sample ended up with many rows because some individuals were matched more than once. I documented the number of matches found for each individual. Here I present a simpler version:
case_id <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "E", "F", "F")
num_controls_matched <- c(7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 5, 5, 5, 5, 5, 2, 2, 1, 2, 2)
control_id <- c("a" , "b", "c", "d", "e", "f", "g", "a", "b", "e", "a", "b", "e", "f", "h", "a", "e", "a", "b", "e")
num_cases_matched <- c(5, 4, 1, 1, 5, 2, 1, 5, 4, 5, 5, 4, 5, 2, 1, 5, 5, 5, 4, 5)
case_id num_controls_matched control_id num_cases_matched
1 A 7 a 5
2 A 7 b 4
3 A 7 c 1
4 A 7 d 1
5 A 7 e 5
6 A 7 f 2
7 A 7 g 1
8 B 3 a 5
9 B 3 b 4
10 B 3 e 5
11 C 5 a 5
12 C 5 b 4
13 C 5 e 5
14 C 5 f 2
15 C 5 h 1
16 D 2 a 5
17 D 2 e 5
18 E 1 a 5
19 F 2 b 4
20 F 2 e 5
where case_id and control_id are IDs of those from the treatment and the control groups, num_controls_matched is the number of matches found for the treated individuals, and num_cases_matched is the number of matches found for individuals in the control group.
I would like to keep as many treated individuals in the sample as possible. I would also like to prioritise the matches for the "less popular" individuals. For example, the treated individual E was only matched to 1 control, so the match E-a should be prioritised. Then, both D and F have 2 matches. Because b has only 4 matches whilst a and e both have 5 matches, F-b should be prioritised. Therefore, D can only be matched with e. The next one should be B because it has 3 matches. However, since a, b and e have already been matched with D, E and F, B has no match (NA). C is matched with h because h has only 1 match. A can be matched with c, d, or g.
I would like to construct data frame to indicate the final 1:1 matches:
case_id control_id
A g
B NA
C h
D e
E a
F b
The original dataset include more than 2,000 individuals, and some individuals have more than 30 matches. Due to the characteristic of some matching variables, propensity score matching is not what I am looking for. I will be really grateful for your help on this.
fun <- function(df, i = 1){
a <- df %>%
filter(num_controls_matched == i | num_cases_matched == i)
b <- df %>%
filter(!(case_id %in% a$case_id | control_id %in% a$control_id))
if (any(table(b$case_id) > 1)) fun(df, i + 1)
else rbind(a, b)[c('case_id', 'control_id')]
}
fun(df)
case_id control_id
1 A a
2 B b
3 C c
I am working with the R programming language. Suppose I have the following data frame:
a = rnorm(100,10,1)
b = rnorm(100,10,5)
c = rnorm(100,10,10)
my_data_2 = data.frame(a,b,c)
my_data_2$group = as.factor(C)
My Question: Suppose I want to add an ID column to this data frame that ranks the first observation as "100" and increases the ID by 1 for each new column. I tried to do this as follows:
my_data_2$id = seq(101, 200, by = 1)
However, this "corrupted" the data frame:
head(my_data_2)
a b c
1 10.381397 9.534634 12.8330946
2 10.326785 6.397006 8.1217063
3 8.333354 11.474064 11.6035562
4 9.583789 12.096404 18.2764387
5 9.581740 12.302016 4.0601871
6 11.772943 9.151642 -0.3686874
group
1 c(9.98552413605153, 9.53807731118048, 6.92589246998173, 8.97095368638206, 9.70249918748529, 10.6161773148626, 9.2514231659343, 10.6566757899233, 10.2351848084123, 9.45970725813352, 9.15347719257448, 9.30428244749624, 8.43075784609759, 11.1200169905262, 11.3493313166827, 8.86895968334901, 9.13208319045466, 9.70062759133717)
2 c(8.90358954387628, 13.8756093430144, 12.9970566311467, 10.4227745183785, 21.3259516051226, 4.88590162247496, 10.260282181, 14.092109840631, 7.37839577680487, 9.09764173775965, 15.1636139760987, 9.9773055885761, 8.29361737323061, 8.61361852648607, 12.6807897406641, 0.00863359720839085, 10.7660528147358, 9.79616528370632)
3 c(25.8063583646201, -11.5722310383483, 8.56096791164312, 12.2858029391835, -0.312392781809937, 0.946343715084028, 2.45881422753051, 7.26197515743391, 0.333766891336273, 14.9149659649045, -4.55483090530928, -19.8075232688082, 16.59106194569, 18.7377329188129, 1.1771203751127, -6.19019973790205, -5.02277721344565, 23.3363430334739)
4 c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
5 c("B", "B", "B", "A", "B", "B", "B", "B", "B", "B", "B", "A", "B", "B", "B", "B", "B", "B")
6 c("B", "B", "B", "B", "B", "A", "B", "B", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B")
id
1 101
2 102
3 103
4 104
5 105
6 106
Can someone please show me how to fix this problem?
Thanks!
Your problem isn‘t your ID column, your problem is where you define your group variable. You call as.factor(C) (note the uppercase C), but the column of your data frame is a lowercase c. So I guess you have defined another object C outsode of your data frame, that now „corrupts“ your data frame.
You maybe want to do:
my_data_2$group <- as.factor(my_data_2$c)
I was able to figure out the answer!
a = rnorm(100,10,1)
b = rnorm(100,10,5)
c = rnorm(100,10,10)
my_data_2 = data.frame(a,b,c)
my_data_2$group = as.factor("C")
my_data_2$id = seq(101, 200, by = 1)
head(my_data_2)
a b c group id
1 9.436773 10.712568 3.7699748 C 101
2 10.265810 3.408589 11.9230024 C 102
3 10.503245 12.197000 8.3620889 C 103
4 9.279878 7.007812 16.8268852 C 104
5 10.683518 8.039032 5.2287997 C 105
6 11.097258 10.313103 0.4988398 C 106
I have the following data frame, describing conditions each patient has (each can have more than 1):
df <- structure(list(patient = c(1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6,
6, 7, 7, 8, 8, 9, 9, 10), condition = c("A", "A", "B", "B", "D",
"C", "A", "C", "C", "B", "D", "B", "A", "A", "C", "B", "C", "D",
"C", "D")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to create a "confusion matrix", which in this case will be a 4x4 matrix where AxA will have the value 5 (5 patients have condition A), AxB will have the value 2 (two patients have A and B), and so on.
How can I achieve this?
You can join the table itself and produce new calculation.
library(dplyr)
df2 <- df
df2 <- inner_join(df,df, by = "patient")
table(df2$condition.x,df2$condition.y)
A B C D
A 5 2 2 1
B 2 5 3 2
C 2 3 6 2
D 1 2 2 4
Here is a base R answer using outer -
count_patient <- function(x, y) {
length(intersect(df$patient[df$condition == x],
df$patient[df$condition == y]))
}
vec <- sort(unique(df$condition))
res <- outer(vec, vec, Vectorize(count_patient))
dimnames(res) <- list(vec, vec)
res
# A B C D
#A 5 2 2 1
#B 2 5 3 2
#C 2 3 6 2
#D 1 2 2 4
I have a data table with a number of observations - each has a non unique ID and integer X value.
grouped by ID. I want the sum of X for each group to be larger than 10. To achieve this, each observation is allowed to change its ID as long as the sum of X for the original group does not go below 10.
Below is an example of what I mean and a very manual solution:
# sample data
input <-data.table(ID = c("A", "A", "A", "B", "B", "B" ,
"C", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5,
6, 10, 2, 3, 3, 4))
# summarise X by ID
input[, .(X = sum(X)), by = ID]
ID X
A 5
B 11
C 18
D 10
# what the output should look like
output <- data.table(ID = c("A", "A", "A", "B", "B", "B" ,
"A", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5,
6, 10, 2, 3, 3, 4))
output[, .(X = sum(X)), by = ID]
ID X
A 11
B 11
C 12
D 10
output
ID X
A 1
A 3
A 1
B 5
B 1
B 5
A 6 - this observation changed ID from C to A to get group A to 11
C 10
C 2
This is obviously a very simple example as only group C has enough spare capacity to donate observations. In reality there might be cases when multiple groups are able to give away observations and when more than one observation has to be modified, however there will only ever be up to 4 groups between which movement can happen.
Is it possible to automate this process in R?
Here is a possible heuristic approach. First, subset to those IDs with less than or greater than 10 (say, there are N of these IDs). Then, distribute the values from these IDs in descending order into smallest of these N buckets.
b <- 10
input <- data.table(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"),
X = c(1, 3, 1, 5, 1, 5, 6, 10, 2, 3, 3, 4))
#find the IDs that need redistribution
scramID <- input[, sum(X), by = ID][V1!=b, ID]
#extract the values for those IDs
x <- input[ID %in% scramID, sort(X, decreasing=TRUE)]
#create list of empty vectors as buckets
l <- replicate(length(scramID), vector('integer'))
#assign the values starting with largest to the bucket with the smallest sum
for (k in x) {
i <- which.min(lapply(l, sum))
l[[i]] <- c(l[[i]], k)
}
#rbind to original dataset to get result
names(l) <- scramID
ans <- rbindlist(list(
setnames(setDT(stack(l)), c("X","ID")),
input[!ID %in% scramID]
), use.names=TRUE)
output:
X ID
1: 10 A
2: 1 A
3: 1 A
4: 6 B
5: 3 B
6: 2 B
7: 5 C
8: 5 C
9: 1 C
10: 3 D
11: 3 D
12: 4 D
and the distribution of values:
> ans[, sum(X), by = ID]
ID V1
1: A 12
2: B 11
3: C 11
4: D 10
For this example, the IDs chosen for redistribution are A, B and C. There are three buckets. The values for these IDs in descending order is 10, 6, 5, 5, 3, 2, 1, 1 and 1.
In the first 3 iterations, 10, 6 and 5 are assigned to first, second and third buckets respectively.
The next number is 5 and this is assigned to the third bucket with the number 5 since it has the smallest sum (5) of the three buckets.
The next number, 3, is assigned to the second bucket since it has the smallest sum (6) of the three buckets.
And so on for the rest of the numbers.
I have a data frame where I want to delete duplicates rows, but I want to delete them only if a value from another column is the same for all the rows. (To be more clear I want to delete the duplicates rows which have the same "Number" value for all rows)
There is a example of my data frame :
df <- data.frame("Name" = c("a", "a", "b", "b", "b", "c", "c", "c"),
"Number" = c(1, 1, 1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
And the result I expect is :
result <- data.frame("Name" = c("b", "b", "b", "c", "c", "c"),
"Number" = c(1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
We can group_by Name and remove groups which have more than 1 row and have only one distinct value.
library(dplyr)
df %>%
group_by(Name) %>%
filter(!(n_distinct(Number) == 1 & n() > 1))
# Name Number
# <chr> <dbl>
#1 b 2
#2 b 2
#3 b 3
and using base R ave, the same logic can be written as
df[with(df, !as.logical(ave(Number, Name, FUN = function(x)
length(unique(x)) == 1 & length(x) > 1))), ]
Here is a solution with data.table
library("data.table")
df <- data.table("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3))
df[, if (uniqueN(Number)!=1 || .N==1) .SD, Name]
and here is a solution with base R:
df <- data.frame("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3), stringsAsFactors = FALSE)
df[as.logical(ave(df$Number, df$Name, FUN=function(x) length(unique(x))!=1 || length(x)==1)),]
We can use data.table methods
library(data.table)
setDT(df)[, .SD[uniqueN(Number) > 1] , Name]
# Name Number
#1: b 1
#2: b 2
#3: b 3
#4: c 4
#5: c 5
#6: c 5