Working with a phylogenetic tree in R, I would like to create a matrix which indicates if each branch of the tree (B1 to B8) is associated with each species (A to E), where 1s indicate that the branch is associated. (Shown below)
The R function which.edge() is useful for identifying the terminal branch for a species. but it doesn't identify ALL the branches associated with each species. What function could I use to identify all the branches in the tree that go from the root to the tip for each species?
Example Tree
library(ape)
ex.tree <- read.tree(text="(A:4,((B:1,C:1):2,(D:2,E:2):1):1);")
plot(ex.tree)
edgelabels() #shows branches 1-8
The is the matrix I would like to create (Species A-E as columns, Branches B1-B8 as rows), but with an easy function rather than by hand.
B1 <- c(1,0,0,0,0)
B2 <- c(0,1,1,1,1)
B3 <- c(0,1,1,0,0)
B4 <- c(0,1,0,0,0)
B5 <- c(0,0,1,0,0)
B6 <- c(0,0,0,1,1)
B7 <- c(0,0,0,1,0)
B8 <- c(0,0,0,0,1)
Mat <- rbind(B1,B2,B3,B4,B5,B6,B7,B8)
colnames(Mat) <- c("A","B","C","D","E")
Mat
For example, Branch B2 goes to species B-E, but not to species A. For Species E, branches B2, B6, B8 are present.
Which R function(s) would be best? Thanks in advance!
I am unaware of any built-in function that does this. I wrote a helper function that can calculate this from the edge data stored in the tree object.
branchNodeAdjacency <- function(x) {
m <- matrix(0, ncol=nt, nrow=nrow(x$edge))
from <- x$edge[,1]
to <- x$edge[,2]
g <- seq_along(x$tip.label)
while (any(!is.na(g))) {
i <- match(g, to)
m[cbind(i, seq_along(i))] <- 1
g <- from[i]
}
rownames(m) <- paste0("B", seq.int(nrow(m)))
colnames(m) <- x$tip.label
m
}
branchNodeAdjacency(ex.tree)
# A B C D E
# B1 1 0 0 0 0
# B2 0 1 1 1 1
# B3 0 1 1 0 0
# B4 0 1 0 0 0
# B5 0 0 1 0 0
# B6 0 0 0 1 1
# B7 0 0 0 1 0
# B8 0 0 0 0 1
The idea is we keep track of which leaf node values are represented by each internal node.
Related
I have a number of ordered lists (or sequences, or vectors, or data table columns) 1, 2, 3, with several items, for example
1 2 3
A A B
G G A
F F G
C E
D C
D
How can I efficiently derive the "master" list which contains all elements in the correct order B, A, G, F, E, C, D? I don't even know what keywords to search for. Any hints are much appreciated.
How about a graph-based approach.
Idea
The idea is to translate the sequences into paths in a directed graph (so A G F C D becomes a path A->G->F->C->D). By simplifying the graph we can then identify the longest connected sequence in that graph, which should then correspond to your "master" sequence.
Implementation
Note that I assume your sample data lst to be a list of vectors (see sample data at the end of this answer).
Let's construct an igraph from the different paths; each path is given by the entries in the lst vectors.
library(igraph)
ig <- make_empty_graph(
n = length(unique(unlist(lst))),
directed = TRUE) %>%
set_vertex_attr("name", value = sort(unique(unlist(lst))))
for (i in 1:length(lst)) ig <- ig + path(lst[[i]])
Next we simplify the graph
ig <- simplify(ig)
It's instructive to plot the graph
plot(ig)
We now extract all simple paths; the longest simple path corresponds to the "master" list.
pths <- sapply(V(ig), function(x) {
p <- all_simple_paths(ig, x)
names(unlist(p[which.max(lengths(p))]))
})
pths[which.max(lengths(pths))]
$B
#[1] "B" "A" "G" "F" "E" "C" "D"
The sequence matches your expected output for the master list.
Sample data
v1 <- c("A","G","F","C","D","D")
v2 <- c("A","G","F","E","C")
v3 <- c("B", "A","G")
lst <- list(v1, v2, v3)
Interesting problem. I think I have a working solution.
My thinking was that we can encode the vectors into a matrix to track which letters must come before and after each other letter by logic. Then we should be able to sort that matrix to find a working order.
Here, I take the three vectors and encode their implied ordering using nested loops.
v1 <- c("A","G","F","C","D","D")
v2 <- c("A","G","F","E","C")
v3 <- c("B", "A","G")
vecs <- list(v1, v2, v3)
unique_ltrs <- unique(unlist(vecs))
ltr_len <- length(unique_ltrs)
m <- matrix(0, nrow = ltr_len, ncol = ltr_len,
dimnames = list(unique_ltrs, unique_ltrs))
# Loops to populate m with what we know
for (v in 1:length(vecs)) {
vec <- unique(unlist(vecs[v]))
for (l in 1:length(vec)) {
for (l2 in 1:length(vec)) {
m_pos <- c(match(vec[l], unique_ltrs),
match(vec[l2], unique_ltrs))
compare <- ifelse(l < l2, -1, ifelse(l2 < l, 1, 0))
m[m_pos[1], m_pos[2]] <- compare
}
}
}
Here, 1 indicates the column letter comes before the row letter, while -1 means the row comes first.
> m
A G F C D E B
A 0 -1 -1 -1 -1 -1 1
G 1 0 -1 -1 -1 -1 1
F 1 1 0 -1 -1 -1 0
C 1 1 1 0 -1 1 0
D 1 1 1 1 0 0 0
E 1 1 1 -1 0 0 0
B -1 -1 0 0 0 0 0
Then we sort the matrix (relying on the code here), and a working order appears in the rownames:
m_ord <- m[do.call(order, as.data.frame(m)),]
#> m_ord
# A G F C D E B
#B -1 -1 0 0 0 0 0
#A 0 -1 -1 -1 -1 -1 1
#G 1 0 -1 -1 -1 -1 1
#F 1 1 0 -1 -1 -1 0
#E 1 1 1 -1 0 0 0
#C 1 1 1 0 -1 1 0
#D 1 1 1 1 0 0 0
rownames(m_ord)
#[1] "B" "A" "G" "F" "E" "C" "D"
When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?
Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.
Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.
We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.
library("quanteda")
# Loading required package: quanteda
# Package version: 1.3.14
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_simil(mydfm))
# a1 b1 a2 b2
# b1 -0.22203788
# a2 0.80492203 -0.23090513
# b2 -0.23427416 0.90082239 -0.28140219
# a3 0.81167608 -0.09065452 0.92242890 -0.12530944
# create a data.frame of the unique pairs and their similarities
sim_pair_names <- t(combn(docnames(mydfm), 2))
sim_pairs <- data.frame(sim_pair_names,
sim = as.numeric(sim),
stringsAsFactors = FALSE)
sim_pairs
# X1 X2 sim
# 1 a1 b1 -0.22203788
# 2 a1 a2 0.80492203
# 3 a1 b2 -0.23427416
# 4 a1 a3 0.81167608
# 5 b1 a2 -0.23090513
# 6 b1 b2 0.90082239
# 7 b1 a3 -0.09065452
# 8 a2 b2 -0.28140219
# 9 a2 a3 0.92242890
# 10 b2 a3 -0.12530944
Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().
# set the threshold for similarity
threshold <- 0.80
# discard one of the pair if similarity > threshold
todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
todrop
# [1] "a1" "a1" "b1" "a2"
# then subset the dfm, keeping only the "keepers"
dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
# 2 x 20 sparse Matrix of class "dfm"
# features
# docs a b c d w g j t l y h x s k i r f p q e
# b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
# a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0
Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.
You already received some excellent answers. But if you prefer a more automated approach targeted at your specific use case, you can use the package LexisNexisTools (which I wrote). It comes with a function called lnt_similarity(), which does exactly what you were looking for. I wrote a quick tutorial with mock data here.
The main difference between the solutions here and in lnt_similarity() is that I also take into account word order, which can make a big difference in some rare cases (see this blog post).
I also suggest you think carefully about thresholds as you might otherwise remove some articles wrongfully. I included a function to visualize the difference between two articles so you can get a better grip of the data you are removing called lnt_diff().
If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().
In this example, cosine similarity scores smaller than 0.9 are all ignored.
library("quanteda")
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
# 5 x 5 sparse Matrix of class "dsTMatrix"
# a1 b1 a2 b2 a3
# a1 1 . . . .
# b1 . 1.0000000 . 0.9113423 .
# a2 . . 1.0000000 . 0.9415838
# b2 . 0.9113423 . 1.0000000 .
# a3 . . 0.9415838 . 1.0000000
matrix2list <- function(x) {
names(x#x) <- rownames(x)[x#i + 1]
split(x#x, factor(x#j + 1, levels = seq(ncol(x)), labels = colnames(x)))
}
matrix2list(sim)
# $a1
# a1
# 1
#
# $b1
# b1
# 1
#
# $a2
# a2
# 1
#
# $b2
# b1 b2
# 0.9113423 1.0000000
#
# $a3
# a2 a3
# 0.9415838 1.0000000
See https://koheiw.net/?p=839 for the performance differences.
I am currently having a problem utilizing R to compare each column within a specific matrix. I have attempted to compare each of the entire columns at once, and generate a true and false output via the table command, and then convert the number of trues that can be found to a numeric value and input such values in their respective places within the incidence matrix.
For example, I have data in this type of format:
//Example state matrix - I am attempting to compare c1 with c2, then c1 with c3, then c1 with c4 and so on and so forth
c1 c2 c3 c4
r1 2 6 3 2
r2 1 1 6 5
r3 3 1 3 6
And I am trying to instead put it into this format
//Example incidence matrix - Which is how many times c1 equaled c2 in the above matrix
c1 c2 c3 c4
c1 3 1 1 1
c2 1 3 0 0
c3 1 0 3 0
c4 1 0 0 3
Here is the code I have come up with so far, however, I keep getting this particular error --
Warning message:
In IncidenceMat[rat][r] = IncidenceMat[rat][r] + as.numeric(instances) :number of items to replace is not a multiple of replacement length
rawData = read.table("5-14-2014streamW636PPstate.txt")
colnames = names(rawData) #the column names in R
df <- data.frame(rawData)
rats = ncol(rawData)
instances = nrow(rawData)
IncidenceMat = matrix(rep(0, rats), nrow = rats, ncol = rats)
for(rat in rats)
for(r in rats)
if(rat == r){rawData[instance][rat] == rawData[instance][r] something like this would work in C++ if I attempted,
IncidenceMat[rat][r] = IncidenceMat[rat][r] + as.numeric(instances)
} else{
count = df[colnames[rat]] == df[colnames[r]]
c = table(count)
TotTrue = as.numeric(c[2][1])
IncidenceMat[rat][r] = IncidenceMat[rat][r] + TotTrue #count would go here #this should work like a charm as well
}
Any help would be greatly appreciated; I have also looked at some of these resources, however, I am still stumped
I tried this and this along with some other resources I recently closed.
How about this (note the incidence matrix is symmetric)?
df
c1 c2 c3 c4
r1 2 6 3 2
r2 1 1 6 5
r3 3 1 3 6
incidence <- matrix(rep(0, ncol(df)*ncol(df)), nrow=ncol(df))
diag(incidence) <- nrow(df)
for (i in 1:(ncol(df)-1)) {
for (j in (i+1):ncol(df)) {
incidence[i,j] = incidence[j,i] = sum(df[,i] == df[,j])
}
}
incidence
[,1] [,2] [,3] [,4]
[1,] 3 1 1 1
[2,] 1 3 0 0
[3,] 1 0 3 0
[4,] 1 0 0 3
Given the following data frame:
> header = c("A1","A2","A3","B1","B2","B3")
> df = matrix(c(0,0,0,0,0,0),nrow = 1)
> colnames(df) = header
> df
A1 A2 A3 B1 B2 B3
[1,] 0 0 0 0 0 0
I know the column index numbers of the headers containing "2" by:
> index2 = grep("2", colnames(df))
> index2
[1] 2 5
I want to add two extra columns named "A2.1","A2.2" and "B2.1", "B2.2" next to the columns with index 2 and 5, so that:
A1 A2 A2.1 A2.2 A3 B1 B2 B2.1 B2.2 B3
[1,] 0 0 0 0 0 0 0 0 0 0 0
Ho can I do this?
Many thanks in advance!
Assuming that you want to insert columns based on 'index2', one option is
df1 <- cbind(df, do.call(cbind,
replicate(2,df[,index2, drop=FALSE], simplify=FALSE)))
df2 <- df1[,order(colnames(df1)), drop=FALSE]
colnames(df2) <- make.unique(colnames(df2))
df2
# A1 A2 A2.1 A2.2 A3 B1 B2 B2.1 B2.2 B3
#[1,] 0 0 0 0 0 0 0 0 0 0
You could try something like this:
set.seed(1234)
df <- data.frame(matrix(runif(100),ncol=5))
colnames(df) <- LETTERS[1:ncol(df)]
B.1 <- runif(20)
df <- cbind(df,B.1)
df <- df[,order(colnames(df))]
#> head(df)
# A B B.1 C D E
#1 0.1137034 0.31661245 0.03545673 0.5533336 0.86483383 0.9264005
#2 0.6222994 0.30269337 0.56507611 0.6464061 0.04185728 0.4719097
#3 0.6092747 0.15904600 0.28025778 0.3118243 0.31718216 0.1426153
#4 0.6233794 0.03999592 0.20419632 0.6218192 0.01374994 0.5442698
#5 0.8609154 0.21879954 0.13373890 0.3297702 0.23902573 0.1961747
#6 0.6403106 0.81059855 0.32568192 0.5019975 0.70649462 0.8985805
It means that you are first attaching the column on the right with cbind() and order the columns sequence afterwards. Hope this helps.
I have a two factor vectors v1 and v2, which appear to be closely related (the entropy of each is very close to their joint entropy). Indeed, when I do table(v1,v2), I see something like this:
v2
v1 a2 b2 c2
a1 0 100 0
b1 0 0 0
c1 0 0 0
v2
v1 d2 e2 f2
a1 0 0 0
b1 0 0 0
c1 0 0 0
and so on - each factor has dozens of levels, so I get plenty of lines with all 0.
How to I print a table omitting lines which have only zeros in them?
Everybody seems to use rowSums(d)==0 or equivalent, but that will also suppress any row with equal numbers of ones and minus ones or any other zero sum combo. Safer would be to use:
d[ rowSums(d==0) != ncol(d) , ]
I suppose in the case where the object is the result of 'table', there would not be the risk of negative entries, but the risk would occur when this strategy is inappropariately applied to other settings.
Using your example:
v1 <- factor(rep("a1", 100), levels = paste0(letters[1:3], 1))
v2 <- factor(rep("b2", 100), levels = paste0(letters[1:6], 2))
R> table(v1, v2)
v2
v1 a2 b2 c2 d2 e2 f2
a1 0 100 0 0 0 0
b1 0 0 0 0 0 0
c1 0 0 0 0 0 0
Then the rowSums() function will compute the row sums for use. This works because a table is a either a vector or a matrix in disguise. Note in the sequence below showing intermediate steps how we convert the row sums into a logical vector by asking if they exceed 0.
R> rowSums(tab)
a1 b1 c1
100 0 0
R> rowSums(tab) > 0
a1 b1 c1
TRUE FALSE FALSE
R> tab[rowSums(tab) > 0, ]
a2 b2 c2 d2 e2 f2
0 100 0 0 0 0
The above drops the empty dimension. If you want to keep the table format, add drop = FALSE to the call, though note the extra , in there as we want all columns hence the empty argument between , ,:
R> tab[rowSums(tab) > 0, , drop = FALSE]
v2
v1 a2 b2 c2 d2 e2 f2
a1 0 100 0 0 0 0
I'd approach this with rowsums to get a logical vector of those greater than 0. And then use that vextor with indexing as in:
#make an example (please do this for yourself in the future)
d <- table(x=1:5, y=1:5)
d[1, 1] <- 0 #make one row have all 0s
d[rowSums(d) > 0, ]
Borrowing example data from #Gavin's answer
v1 <- factor(rep("a1", 100), levels = paste0(letters[1:3], 1))
v2 <- factor(rep("b2", 100), levels = paste0(letters[1:6], 2))
You can use droplevels to eliminate those value that do not appear anywhere (equivalent to rows with all 0's, or columns with all 0's)
> table(droplevels(v1), droplevels(v2))
b2
a1 100
If you only want to drop rows:
> table(droplevels(v1), v2)
v2
a2 b2 c2 d2 e2 f2
a1 0 100 0 0 0 0