How to quantify observation pairs in individuals - r

I'm looking for a way to quantify observation pairs in individuals (patients). In this example I have patients who each had two different diseases. The couple of disease(that is, in the same individuals) "a" and "b" is repeated 4 times, for example, in patients "G", "H", "I" and "J" and the couple "k" and "o" is repeated twice (patient "D" has done diseases "k" and "o" and patient "E" has also done these two diseases).
Patient_ID<- c("A","A","B","B","C","C","D","D","E","E","F","F",
"G","G","H","H","I","I","J","J")
Disease<-c("v","s","s","v","s","v" ,"k","o","k","o","o","s","a","b",
"a","b","b","a","b","a")
DATA<-data.frame(Patient_ID,Disease)
print(DATA)
Patient_ID Disease
1 A v
2 A s
3 B s
4 B v
5 C s
6 C v
7 D k
8 D o
9 E k
10 E o
11 F o
12 F s
13 G a
14 G b
15 H a
16 H b
17 I b
18 I a
19 J b
20 J a
With these statistics I would like to generate such a table below.
a b k o v s
a 0 4 0 0 0 0
b 4 0 0 0 0 0
k 0 0 0 2 0 0
o 0 0 2 0 0 1
v 0 0 0 0 0 3
s 0 0 0 1 3 0
Then generate a table for only levels that have count above a certain threshold (for example 2) like in the second table (below).
a b v s
a 0 4 0 0
b 4 0 0 0
v 0 0 0 3
s 0 0 3 0

Here is a base R option using table+crossprod, i.e.,
res <- `diag<-`(crossprod(table(DATA)),0)
which gives
> res
Disease
Disease a b k o s v
a 0 4 0 0 0 0
b 4 0 0 0 0 0
k 0 0 0 2 0 0
o 0 0 2 0 1 0
s 0 0 0 1 0 3
v 0 0 0 0 3 0
For the subset by given threshold, you can use
th <- 2
inds <- rowSums(res > th)>0
subset_res <- subset(res,inds,inds)
which gives
> subset_res
Disease
Disease a b s v
a 0 4 0 0
b 4 0 0 0
s 0 0 0 3
v 0 0 3 0

At first, use unstack() to transform Disease to a data frame with 2 columns. Remember to make both columns have equal levels. This step is to prevent dropping levels in the following operation. Then input the data frame into table() and it'll create a contingency table. In this table, "a & b" and "b & a" are different. To compute the total counts, you need tab + t(tab).
pair <- data.frame(t(unstack(DATA, Disease ~ Patient_ID)))
pair[] <- lapply(pair, factor, levels = levels(DATA$Disease))
tab <- table(pair)
tab + t(tab)
# X2
# X1 a b k o s v
# a 0 4 0 0 0 0
# b 4 0 0 0 0 0
# k 0 0 0 2 0 0
# o 0 0 2 0 1 0
# s 0 0 0 1 0 3
# v 0 0 0 0 3 0

Related

Make a adjacency matrix in R

I want to make an adjacency matrix from a dataframe (mydata) consisting several rows with following rule:
List all letters as a square matrix
Count and sum number of connection from source from rest of columns (p1 p2 p3 p4 p5) of corresponding rows. For example, b is connected with a (2 and 8 rows) 5 times.
If letter is not included in source , connection values should be zero.
The dataframe is:
mydf <- data.frame(p1=c('a','a','a','b','g','b','c','c','d'),
p2=c('b','c','d','c','d','e','d','e','e'),
p3=c('a','a','c','c','d','d','d','a','a'),
p4=c('a','a','b','c','c','e','d','a','b'),
p5=c('a','b','c','d','I','b','b','c','z'),
source=c('a','b','c','d','e','e','a','b','d'))
The adjacency matrix should be as following
a b c d e g I z
a 4 2 1 3 0 0 0 0
b 5 1 3 0 1 0 0 0
c 1 1 2 1 0 0 0 0
d 1 2 3 2 1 0 0 1
e 0 2 1 3 2 1 1 0
g 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0
z 0 0 0 0 0 0 0 0
I have hundreds of columns and thousands of rows. I would appreciate having any fastest way to do it in R
In base R, we can use table :
vals <- unlist(mydf[-ncol(mydf)])
table(factor(rep(mydf$source, ncol(mydf) - 1), levels = unique(vals)), vals)
# vals
# a b c d e g I z
# a 4 2 1 3 0 0 0 0
# b 5 1 3 0 1 0 0 0
# g 0 0 0 0 0 0 0 0
# c 1 1 2 1 0 0 0 0
# d 1 2 3 2 1 0 0 1
# e 0 2 1 3 2 1 1 0
# I 0 0 0 0 0 0 0 0
# z 0 0 0 0 0 0 0 0
In tidyverse we can do :
library(dplyr)
library(tidyr)
mydf %>%
pivot_longer(cols = -source) %>%
count(source, value) %>%
pivot_wider(names_from = value, values_from = n) %>%
complete(source = names(.)[-1]) %>%
mutate_all(~replace_na(., 0))

Create an adjacency matrix from unbalanced trade flow data in R

I have a dataset of bilateral trade flows of dimension 84x244.
How can I balance the dataset to look like a 244x244 matrix but keeping the same order and names as the columns?
Non-symmetric matrix
For example the matrix resembles:
A B C D
B 0 0 0 1
D 2 0 0 0
and it should look like
A B C D
A 0 0 0 0
B 0 0 0 1
C 0 0 0 0
D 2 0 0 0
With A B C D as row and column names
Here are two methods that ensure the column names and row names are effectively the same, using a default value of 0 for missing rows/columns. These do not assume that the columns are always full; if this is guaranteed, then you can ignore the column-adding portions.
Both start with:
m <- as.matrix(read.table(header=TRUE, text="
A B C D
B 0 0 0 1
D 2 0 0 0"))
First
needrows <- setdiff(colnames(m), rownames(m))
m <- rbind(m, matrix(0, nrow=length(needrows), ncol=ncol(m), dimnames=list(needrows, colnames(m))))
needcols <- setdiff(rownames(m), colnames(m))
m <- cbind(m, matrix(0, nrow=nrow(m), ncol=length(needcols), dimnames=list(rownames(m), needcols)))
m
# A B C D
# B 0 0 0 1
# D 2 0 0 0
# A 0 0 0 0
# C 0 0 0 0
And to order the rows same as the columns ... note that if there are row names not present in the column names, they will be removed in this, though you can include them with another setdiff if needed.
m[colnames(m),]
# A B C D
# A 0 0 0 0
# B 0 0 0 1
# C 0 0 0 0
# D 2 0 0 0
Second
allnames <- sort(unique(unlist(dimnames(m))))
m2 <- matrix(0, nrow=length(allnames), ncol=length(allnames),
dimnames=list(allnames, allnames))
m2[intersect(rownames(m), allnames), colnames(m)] <-
m[intersect(rownames(m), allnames), colnames(m)]
m2[rownames(m), intersect(colnames(m), allnames)] <-
m[rownames(m), intersect(colnames(m), allnames)]
m2
# A B C D
# A 0 0 0 0
# B 0 0 0 1
# C 0 0 0 0
# D 2 0 0 0
Here is a base R solution. The basic idea is that, you first construct a square matrix will all zeros and assign row names with its column names, and then assign value to the rows according to row names, i.e.,
M <- `dimnames<-`(matrix(0,nrow = ncol(m),ncol = ncol(m)),
replicate(2,list(colnames(m))))
M[rownames(m),] <- m
such that
> M
A B C D
A 0 0 0 0
B 0 0 0 1
C 0 0 0 0
D 2 0 0 0

Count all the letters (26) of one of the char variable in a dataframe

I have a dataframe with a few columns like this:
Attr Description
60 asdfg asdg dfs
50 smlefekl dewld ewf
35 kojewdfhef e
All I need is to create extra 26 columns with counts of each letter in a row. I know I can use:
table(unlist(strsplit(mydata, ""), use.names=FALSE))
for a vector, but how can I update it for a dataframe?
If we are using the strsplit, then we may need to create a factor with levels specified as 'letters'
d1 <- stack(setNames(strsplit(df1$Description, ""), seq_len(nrow(df1))))
d2 <- subset(d1, values != " ")
d2$values <- factor(d2$values, levels = letters)
t(table(d2))
# values
# ind a b c d e f g h i j k l m n o p q r s t u v w x y z
# 1 2 0 0 3 0 2 2 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0
# 2 0 0 0 2 4 2 0 0 0 0 1 3 1 0 0 0 0 0 1 0 0 0 2 0 0 0
# 3 0 0 0 1 3 2 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
Or as showed in the comments, use the str_count from stringr by looping through the 'letters' get the count of that letter for each row of 'Description'
library(stringr)
t(sapply(letters, function(x) str_count(df1$Description, x)))

Measure weight of communities for different subgraphs

I detect communities in my adjacency matrix. Parallely, I create an affiliation matrix using the vertices of the same matrix. How do I measure the weight of the communities in each of the columns of the affiliation matrix?
Take the following adjacency matrix:
A B C D E F G
A 0 1 0 1 0 1 0
B 1 0 1 1 0 1 0
C 0 1 0 0 0 0 0
D 1 1 0 0 1 1 0
E 0 0 0 1 0 1 0
F 1 1 0 1 1 0 1
G 0 0 0 0 0 1 0
I identify the communities:
com <- edge.betweenness.community(g)
V(g)$memb <- com$membership
Now take the following affiliation matrix:
P R Q
A 1 1 0
B 1 0 1
C 1 1 0
D 0 1 0
E 1 0 1
F 0 0 1
G 1 1 0
How do I count the number of vertices corresponding to community [[1]] which are affiliated to the "P" in the affiliation matrix?
You can do sum(m[com[[1]],"P"]>0), given that m holds your affiliation matrix. Or lapply(com, function(x) colSums(m[x, ])) for all communities.

How could i calculate the sparsity of a data.frame in R?

i have a data.frame structured like this:
A B C D E
F 1 0 7 0 0
G 0 0 0 1 1
H 1 1 0 0 0
I 1 2 1 0 0
L 1 0 0 0 0
and i want to calculate the sparsity(i.e. the percentage of 0 values) of this data.frame.
How could i do?
sum(df == 0)/(dim(df)[1]*dim(df)[2])
[1] 0.6

Resources