Create a co-occurrence matrix from a single column of observations - r

I have a series of data frames, each with an individual identifier (in this example a letter A-E), and the site number it was observed at.
In this example, I have 3 data frames:
Letters<-c("A","B","C","D","E")
Site1<-c(1,1,2,2,2)
Site2<-c(10,10,20,30,30)
Site3<-c(17,27,37,47,57)
Df1<-data.frame(Letters, Site1)
Df2<-data.frame(Letters, Site2)
Df3<-data.frame(Letters, Site3)
For the first one, it ends up looking like this:
Df1
Letters Site
1 A 1
2 B 1
3 C 2
4 D 2
5 E 2
Individuals A and B were found at Site 1, and individuals C,D,and E were found at site 2.
I'm looking for a way to track which individuals are found within the same sites within a single dataframe (note the site numbers change each time, so I only care about within-dataframe groupings).
I'm assuming I would create individual co-occurrence matrix, with each single matrix only having a 1 or a 0 indicating whether an individual overlapped. Then the last step would be just to add them up like so:
DF1 co-occurrence
A B C D E
A 1 1 0 0 0
B 1 1 0 0 0
C 0 0 1 1 1
D 0 0 1 1 1
E 0 0 1 1 1
DF2 co-occurrence
A B C D E
A 1 1 0 0 0
B 1 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 1
E 0 0 0 1 1
DF3 co-occurrence
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 0
E 0 0 0 0 1
And then add them up to see who is most often grouped with whom:
A B C D E
A 3 2 0 0 0
B 2 3 0 0 0
C 0 0 3 1 1
D 0 0 1 3 2
E 0 0 1 2 3
But I'm not sure how to implement this kind of workflow in R, or if this is even the best way to approach this problem. But my hope is to end up with a similar matrix to this last one above, or some similar method to quantify total co-occurrence

Related

R: Squared contingency table [duplicate]

This question already has an answer here:
How to create missing values in table in R?
(1 answer)
Closed 2 years ago.
I want to make a contingency table with observations and their predictions based on a neural network. Since I want positives to be on the diagonal, I would like my table to be squared, regardless if there are rows with just 0's. That is, I would like to have
b
a a b c d e f g
a 1 0 1 0 2 1 0
b 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0
d 2 3 1 2 2 3 2
e 1 2 1 1 0 1 3
f 0 0 0 0 0 0 0
g 4 2 1 0 3 1 0
Instead of:
> set.seed(1)
> b<-sample(letters[1:7],40,rep=TRUE)
> a<-sample(letters[1:4],40,rep=TRUE)
>
> table(a,b)
b
a a b c d e f g
a 1 0 1 0 2 1 0
d 2 3 1 2 2 3 2
e 1 2 1 1 0 1 3
g 4 2 1 0 3 1 0
How can I do this?
Convert a and b to factor with levels as union of both :
tmp <- sort(union(a, b))
table(factor(a, levels = tmp), factor(b, levels = tmp))
# a b c d e f g
# a 0 1 1 2 2 1 4
# b 2 1 1 1 2 3 2
# c 4 0 1 2 0 1 1
# d 0 1 1 1 3 1 1
# e 0 0 0 0 0 0 0
# f 0 0 0 0 0 0 0
# g 0 0 0 0 0 0 0

Measure weight of communities for different subgraphs

I detect communities in my adjacency matrix. Parallely, I create an affiliation matrix using the vertices of the same matrix. How do I measure the weight of the communities in each of the columns of the affiliation matrix?
Take the following adjacency matrix:
A B C D E F G
A 0 1 0 1 0 1 0
B 1 0 1 1 0 1 0
C 0 1 0 0 0 0 0
D 1 1 0 0 1 1 0
E 0 0 0 1 0 1 0
F 1 1 0 1 1 0 1
G 0 0 0 0 0 1 0
I identify the communities:
com <- edge.betweenness.community(g)
V(g)$memb <- com$membership
Now take the following affiliation matrix:
P R Q
A 1 1 0
B 1 0 1
C 1 1 0
D 0 1 0
E 1 0 1
F 0 0 1
G 1 1 0
How do I count the number of vertices corresponding to community [[1]] which are affiliated to the "P" in the affiliation matrix?
You can do sum(m[com[[1]],"P"]>0), given that m holds your affiliation matrix. Or lapply(com, function(x) colSums(m[x, ])) for all communities.

Match combinations of row values between 2 different data frames

I have a data.frame with 16 different combinations of 4 different cell markers
combinations_df
FITC Cy3 TX_RED Cy5
a 0 0 0 0
b 1 0 0 0
c 0 1 0 0
d 1 1 0 0
e 0 0 1 0
f 1 0 1 0
g 0 1 1 0
h 1 1 1 0
i 0 0 0 1
j 1 0 0 1
k 0 1 0 1
l 1 1 0 1
m 0 0 1 1
n 1 0 1 1
o 0 1 1 1
p 1 1 1 1
I have my "main" data.frame with 10 columns and thousands of rows.
> main_df
a b FITC d Cy3 f TX_RED h Cy5 j
1 0 1 1 1 1 0 1 1 1 1
2 0 1 0 1 1 0 1 0 1 1
3 1 1 0 0 0 1 1 0 0 0
4 0 1 1 1 1 0 1 1 1 1
5 0 0 0 0 0 0 0 0 0 0
....
I want to use all the possible 16 combinations from combinations_df to compare with each row of main_df. Then I want to create a new vector to later cbind to main_df as column 11.
sample output
> phenotype
[1] "g" "i" "a" "p" "g"
I thought about doing a while loop within a for loop checking each combinations_df row through each main_df row.
Sounds like it could work, but I have close to 1 000 000 rows in main_df, so I wanted to see if anybody had a better idea.
EDIT: I forgot to mention that I want to compare combinations_df only to columns 3,5,7,9 from main_df. They have the same name, but it might not be that obvious.
EDIT: Changin the sample data output, since no "t" should be present
The dplyr solution is outrageously simple. First you need to put phenotype in combinations_df as an explicit variable like this:
# phenotype FITC Cy3 TX_RED Cy5
#1 a 0 0 0 0
#2 b 1 0 0 0
#3 c 0 1 0 0
#4 d 1 1 0 0
# etc
dplyr lets you join on multiple variables, so from here it's a one-liner to look up the phenotypes.
library(dplyr)
left_join(main_df, combinations_df, by=c("FITC", "Cy3", "TX_RED", "Cy5"))
# a b FITC d Cy3 f TX_RED h Cy5 j phenotype
#1 0 1 1 1 1 0 1 1 1 1 p
#2 0 1 0 1 1 0 1 0 1 1 o
#3 1 1 0 0 0 1 1 0 0 0 e
#4 0 1 1 1 1 0 1 1 1 1 p
#5 0 0 0 0 0 0 0 0 0 0 a
I originally thought you'd have to concatenate columns with tidyr::unite but this was not the case.
Its not very elegant but this method works just fine. There are no loops in loops here so it should run just fine. Might trying to match using the dataframe rows and do away with the loops all together but this was just the fastest way I could figure it out. You might look at packages plyr or data.table. Very powerful packages for this kind of thing.
main_text=NULL
for(i in 1:length(main_df[,1])){
main_text[i]<-paste(main_df[i,3],main_df[i,5],main_df[i,7],main_df[i,9],sep="")
}
comb_text=NULL
for(i in 1:length(combinations_df[,1])){
comb_text[i]<-paste(combinations_df[i,1],combinations_df[i,2],combinations_df[i,3],combinations_df[i,4],sep="")
}
rownames(combinations_df)[match(main_text,comb_text)]
How about something like this? My results are different than yours as there is no "t" in the combination_df. You could do it without assigning a new column to if you wanted. This is mainly for illustrative purposes.
combination_df <- read.table("Documents/comb.txt.txt", header=T)
main_df <- read.table("Documents/main.txt", header=T)
main_df
combination_df
main_df$key <- do.call(paste0, main_df[,c(3,5,7,9)])
combination_df$key <- do.call(paste0, combination_df)
rownames(combination_df)[match(main_df$key, combination_df$key)]

calculating sum of all against all columns with matching row count

I have a df with several columns having values 0 or 1. Something like:
a b c d e
1 0 0 0 0
0 1 0 1 0
0 1 0 1 0
1 0 1 0 1
I would like to create a 5 by 5 matrix showing total count if columns have 1 in same row. I only want to consider 1's and in case of diagonal it would automatically reflect total row in that column with 1. Output something like:
a b c d e
a 2 0 1 0 1
b 0 2 0 2 0
c 1 0 1 0 1
d 0 2 0 2 0
e 1 0 1 0 1
Thanks.
Sudhir
Convert to matrix and take cross product:
m <- as.matrix(d)
crossprod(m,m)

Delete columns from a square matrix that sum to zero along with corresponding rows

I have a binary transition matrix. I want to delete rows associated with columns that sum to zero. For example, if
A B C D E
A 0 0 0 1 0
B 1 0 0 1 0
C 0 0 1 1 0
D 0 0 1 0 0
E 0 0 1 1 0
column B and E sum to zero. I know how to get rid of the columns like this,
> a.adj=a[,!!colSums(a)]
> a.adj
A C D
A 0 0 1
B 1 0 1
C 0 1 1
D 0 1 0
E 0 1 1
but how can I at the same time delete rows B and E to get
A C D
A 0 0 1
C 0 1 1
D 0 1 0
If the rownames and colnames are in the same order
indx <- !!colSums(a)
a[indx,indx]
# A C D
#A 0 0 1
#C 0 1 1
#D 0 1 0
Use names to select both columns and rows
> ind <- colnames(a[,!!colSums(a)])
> a[ind, ind]
A C D
A 0 0 1
C 0 1 1
D 0 1 0

Resources