Reducing the number of nodes of a tree, to obtain nodes with more than one child node - r

The following tree:
has been obtained from the following matrix
> mat
7 23 47 41 31
7 23 53 41 31
7 23 53 41 37
7 29 47 41 31
7 29 47 41 37
7 29 53 41 31
7 29 53 41 37
11 29 53 41 31
11 29 53 41 37
taking each columns of 'mat' as a level of the tree. If 'data' is the dataframe where the matrix 'mat' is stored
V1 V2 V3 V4 V5
7 23 47 41 31
7 23 53 41 31
7 23 53 41 37
7 29 47 41 31
7 29 47 41 37
7 29 53 41 31
7 29 53 41 37
11 29 53 41 31
11 29 53 41 37
the code that produces above tree is the following
> data$pathString<-paste("0", data$V1,data$V2,data$V3,data$V4,data$V5,sep = "/")
> p_tree <- as.Node(data)
> export_graph(ToDiagrammeRGraph(p_tree), "tree.png")
I would like to modify the tree as follows: (1) if a node at level 'n', labelled by number x, has only one child node at level 'n+1', labelled by number y, then the program brings together these two nodes in one node labelled by the result of the product x*y; 2) if the node at level 'n+1' does not have child nodes, the program does nothing and starts again from another branch; 3) if the node at level 'n+1' has more than one child node, the program apply point (1) and starts again from each of child nodes.
For example, for the tree of our example, the code should:
replace the nodes circled in red with a node labelled by 31*41*47=59737
replace the nodes circled in orange with a node labelled by 53*41=2173
replace the nodes circled in green with a node labelled by 47*41=1927
replace the nodes circled in blue with a node labelled by 11*29*53*41=693187

Try this:
freq <- sapply(1:ncol(data), function(x) {
df <- data[, 1:x, drop = FALSE]
cc <- aggregate(df[, 1], as.list(df), FUN = length)
merge(df, cc, by = colnames(df), sort = FALSE)[, "x"]
})
data$pathString <- sapply(1:nrow(data), function(x) {
g <- 1
for(i in 2:ncol(freq)) g <- c(g,
if(freq[x, i] == freq[x, i - 1]) g[i - 1] else g[i - 1] + 1)
paste0(c("0", tapply(unlist(data[x, , drop = TRUE]), g, prod)), collapse = "/")
})
p_tree <- as.Node(data)
plot(p_tree)

Related

How to use characters in variables summing in R?

I have some dataframe. Here is a small expample:
a <- rnorm(100, 5, 2)
b <- rnorm(100, 10, 3)
c <- rnorm(100, 15, 4)
df <- data.frame(a, b, c)
And I have a character variable vect <- "c('a','b')"
When I try to calculate sum of vars using command
df$d <- df[vect]
which must be an equivalent of
df$d <- df[c('a','b')]
But, as a reslut I have got an error
[.data.frame(df, vect) :undefined columns selected
You're assumption that
vect <- "c('a','b')"
df$d <- df[vect]
is equivalent to
df$d <- df[c('a','b')]
is incorrect.
As #Karthik points out, you should remove the quotation marks in the assignment to vect
However, from your question it sounds like you want to then sum the elements specified in vect and then assign to d. To do this you need to slightly change your code
vect <- c('a','b')
df$d <- apply(X = df[vect], MARGIN = 1, FUN = sum)
This does elementwise sum on the columns in df specified by vect. The MARGIN = 1 specifies that we want to apply the sum rowise rather than columnwise.
EDIT:
As #ThomasIsCoding points out below, if for some reason vect has to be a string, you can parse a string to an R expression using str2lang
vect <- "c('a','b')"
parsed_vect <- eval(str2lang(vect))
df$d <- apply(X = df[parsed_vect], MARGIN = 1, FUN = sum)
Perhaps you can try
> df[eval(str2lang(vect))]
a b
1 8.1588519 9.0617818
2 3.9361214 13.2752377
3 5.5370983 8.8739725
4 8.4542050 8.5704234
5 3.9044461 13.2642793
6 5.6679639 12.9529061
7 4.0183808 6.4746806
8 3.6415608 11.0308990
9 4.5237453 7.3255129
10 6.9379168 9.4594150
11 5.1557935 11.6776181
12 2.3829337 3.5170335
13 4.3556430 7.9706624
14 7.3274615 8.1852829
15 -0.5650641 2.8109197
16 7.1742283 6.8161200
17 3.3412044 11.6298940
18 2.5388981 10.1289533
19 3.8845686 14.1517643
20 2.4431608 6.8374837
21 4.8731053 12.7258259
22 6.9534912 6.5069513
23 4.4394807 14.5320225
24 2.0427553 12.1786148
25 7.1563978 11.9671603
26 2.4231207 6.1801862
27 6.5830372 0.9814878
28 2.5443326 9.8774632
29 1.1260322 9.4804636
30 4.0078436 12.9909014
31 9.3599808 12.2178596
32 3.5362245 8.6758910
33 4.6462337 8.6647953
34 2.0698037 7.2750532
35 7.0727970 8.9386798
36 4.8465248 8.0565347
37 5.6084462 7.5676308
38 6.7617479 9.5357666
39 5.2138482 13.6822924
40 3.6259103 13.8659939
41 5.8586547 6.5087016
42 4.3490281 9.5367522
43 7.5130701 8.1699117
44 3.7933813 9.3241308
45 4.9466813 9.4432584
46 -0.3730035 6.4695187
47 2.0646458 10.6511916
48 4.6027309 4.9207746
49 5.9919348 7.1946723
50 6.0148330 13.4702419
51 5.5354452 9.0193366
52 5.2621651 12.8856488
53 6.8580210 6.3526151
54 8.0812166 14.4659778
55 3.6039030 5.9857886
56 9.8548553 15.9081336
57 3.3675037 14.7207681
58 3.9935336 14.3186175
59 3.4308085 10.6024579
60 3.9609624 6.6595521
61 4.2358603 10.6600581
62 5.1791856 9.3241118
63 4.6976289 13.2833055
64 5.1868906 7.1323826
65 3.1810915 12.8402472
66 6.0258287 9.3805249
67 5.3768112 6.3805096
68 5.7072092 7.1130150
69 6.5789349 8.0092541
70 5.3175820 17.3377234
71 9.7706112 10.8648956
72 5.2332127 12.3418373
73 4.7626124 13.8816910
74 3.9395911 6.5270785
75 6.4394724 10.6344965
76 2.6803695 10.4501753
77 3.5577834 8.2323369
78 5.8431140 7.7932460
79 2.8596818 8.9581837
80 2.7365174 10.2902512
81 4.7560973 6.4555758
82 4.6519084 8.9786777
83 4.9467471 11.2818536
84 5.6167284 5.2641380
85 9.4700525 2.9904731
86 4.7392906 11.3572521
87 3.1221908 6.3881556
88 5.6949432 7.4518023
89 5.1435241 10.8912283
90 2.1628966 10.5080671
91 3.6380837 15.0594135
92 5.3434709 7.4034042
93 -0.1298439 0.4832707
94 7.8759390 2.7411723
95 2.0898649 9.7687250
96 4.2131549 9.3175228
97 5.0648105 11.3943350
98 7.7225193 11.4180456
99 3.1018895 12.8890257
100 4.4166832 10.4901303

GAM Predictions in R have same curve shape

I have a dataframe:
Albedo Year_Since_Burn Summer_SRAD Winter_SRAD
1 397.00 1 17801.70 6589.56
2 289.60 2 18027.20 6633.96
3 615.29 3 17397.10 6952.69
4 258.12 4 17793.63 6627.62
5 139.32 5 17853.00 6675.00
6 463.81 6 17853.00 6675.00
7 532.47 7 17853.00 6675.00
8 300.09 8 17648.00 6890.00
9 118.00 9 17786.13 6724.67
10 238.18 10 18050.13 6916.46
11 439.11 11 18057.20 6893.08
12 366.00 12 17823.00 6618.12
13 441.25 13 17809.50 6673.79
14 450.31 14 17654.40 6849.19
15 275.43 15 17592.80 7202.88
16 147.11 16 17830.20 6672.88
17 285.68 17 18065.13 6897.58
18 309.61 18 17665.80 7036.62
19 264.95 19 18053.47 6867.17
20 125.18 20 17834.40 6661.19
21 289.50 21 17824.00 6684.50
22 293.61 22 17826.90 6681.83
23 368.95 23 17634.55 6914.06
24 563.11 24 17434.23 7043.04
25 434.41 25 17527.60 7070.38
26 199.78 26 17955.40 6704.00
27 153.37 27 17872.70 6637.00
28 287.29 28 17843.20 6659.67
29 173.52 29 17822.93 6616.75
30 239.28 30 17884.00 6580.56
31 292.91 31 17884.00 6580.56
32 323.00 32 18078.70 6758.50
33 282.00 33 18078.70 6758.50
34 237.50 34 17779.10 7303.38
35 225.00 35 17822.80 6617.42
36 237.55 36 17822.80 6617.42
37 247.11 37 17918.50 6695.71
38 336.48 38 17918.50 6695.71
39 290.00 39 17918.50 6695.71
40 248.42 40 17822.80 6617.42
41 304.74 41 17918.50 6695.71
42 311.52 42 17918.50 6695.71
43 281.39 43 17918.50 6695.71
44 234.68 44 17918.50 6695.71
45 297.58 45 17918.50 6695.71
46 265.52 46 17918.50 6695.71
47 186.29 47 17918.50 6695.71
48 291.16 48 17918.50 6695.71
49 185.17 49 17918.50 6695.71
50 288.94 50 17918.50 6695.71
51 269.64 51 17918.50 6695.71
52 255.00 52 17918.50 6695.71
I am fitting a GAM model in R like so:
gam.m1 <- gam(Albedo ~ s(Year_Since_Burn) + s(Summer_SRAD) + s(Winter_SRAD), data=df)
which seems to work fine, and returns result as I would expect.
I have now created some data to predict on. Essentially I randomly selected 2 rows in the original df, duplicated them 52 times each, and then removed the Year_Since_Burn and Albedo columns, and created some new Year_Since_Burn data. So I only manipulated one independent variable. I did this like so:
df <- df[sample(nrow(df), 2),]
df <- df %>% select (-c(Albedo, Year_Since_Burn))
#add an id columns
df$ID <- seq.int(nrow(df))
#loop through each row
for (i in 1:nrow(df)) {
#select each row
grp <- (df[i, ])
#repeat each row 52 times
grp <- grp[rep(seq_len(nrow(grp)), each=52),]
#add a column for year since burn
grp$Year_Since_Burn <- seq.int(nrow(grp))
#select rows to keep
grp <- grp %>% select (c(ID, Year_Since_Burn,Summer_SRAD, Winter_SRAD, Winter_Tavg, Summer_Tavg, PFI, Bulk_Density,
SOC_Content, SOC_Stock, L3_Ecoregion))
#append
combined[[i]] <- grp
}
#concat
final = do.call(rbind, combined)
Now for each unique ID in final I predicted the dependent variable like so:
y_hat <- predict(gam.m1, final)
and then I plotted to look at how the predictions varied with Year_Since_Burn:
final2 <- data.frame(as.array(final$ID), as.array(final$Year_Since_Burn), y_hat2)
names(final2) <- c("ID", 'Year_Since_Burn', "Predicted")
#plot
plots1 <- lapply(split(final2, final2$ID),
function(x)
ggplot(x, aes(x=Year_Since_Burn, y=Predicted)) +
geom_line())
For the output graphs the curves are identical in shape for every prediction, it is just the magnitudes that are shifting. I am not sure if this is what GAMs is supposed to do or if this is an error on my part. This is what the predictions for two different Id's look like:

R: many nested loops to remove rows in multiple data frames

I have 18 data frames called regular55, regular56, regular57, collar55, collar56, etc. In each data frame, I want to delete the first row of each nest.
Each data frame looks like this:
nest interval
1 17 -8005
2 17 183
3 17 186
4 17 221
5 17 141
6 17 30
7 17 158
8 17 23
9 17 199
10 17 51
11 17 169
12 17 176
13 31 905
14 31 478
15 31 40
16 31 488
17 31 16
18 31 203
19 31 54
20 31 341
21 31 54
22 50 -14164
23 50 98
24 50 1438
25 71 240
26 71 725
27 71 819
28 85 -13935
29 85 45
30 85 589
31 85 47
32 85 161
33 85 67
The solution I came up with to avoid writing out the function for each one of the 18 data frames includes many nested loops:
for (i in 5:7){
for (j in 5:7) {
for (k in c("regular","collar")){
for (l in c(unique(paste0(k,i,j,"$nest")))){
paste0(k,i,j)=paste0(k,i,j)[(-c(which((paste0(k,i,j,"$nest")) == l )
[1])),]
}}}}
I'm basically selecting the first value at "which" there is a "unique" value of nest. However, I get:
Error in paste0(k, i, j)[(-c(which((paste0(k, i, j, "$nest")) == l)[1])), :
incorrect number of dimensions
It might be because "paste0(k,i,j)" is only considered as a character and not recognized as the name for a data frame.
Any ideas on how to fix this? Or any other ways to delete the first rows for each nest in every data frame?
Thanks to help from the comments, my problem was solved.
Originally, I divided my data frame using a for loop and then grouped it into one list:
for (i in 5:7) {
for (j in 5:7) {
for (k in c("regular","collar")){
assign(paste0(k,i,j),
df[df$x == i & df$y == j & df$z == k,])
}}}
df.list=mget(ls(pattern=("[regular,collar][5-7][5-7]")))
I later found a way to split my data frame directly into a list based on multiple columns (R subsetting a data frame into multiple data frames based on multiple column values):
df.list= split(df, with(df, interaction(df$x, df$y, df$z)), drop = TRUE)
Finally, I was able to apply the function to remove the first rows of each nest:
df.list.updated = lapply(df.list, function(d) d %>% group_by(nest) %>%
slice(2:n()))
It is definitely easier to work from a list of data frames.

Performence for calculating the distance between two positions on a tree?

Here is a tree. The first column is an identifier for the branch, where 0 is the trunk, L is the first branch on the left and R is the first branch on the right. LL is the branch on the extreme left after the second bifurcation, etc.. the variable length contains the length of each branch.
> tree
branch length
1 0 20
2 L 12
3 LL 19
4 R 19
5 RL 12
6 RLL 10
7 RLR 12
8 RR 17
tree = data.frame(branch = c("0","L", "LL", "R", "RL", "RLL", "RLR", "RR"), length=c(20,12,19,19,12,10,12,17))
tree$branch = as.character(tree$branch)
and here is a drawing of this tree
Here are two positions on this tree
posA = tree[4,]; posA$length = 12
posB = tree[6,]; posB$length = 3
The positions are given by the branch ID and the distance (variable length) to the origin of the branch (more info in edits).
I wrote the following messy distance function to calculate the shortest distance along the branches between any two points on the tree. The shortest distance along the branches can be understood as the minimal distance an ant would need to walk along the branches to reach one position from the other position.
distance = function(tree, pos1, pos2){
if (identical(pos1$branch, pos2$branch)){Dist=pos1$length-pos2$length;return(Dist)}
pos1path = strsplit(pos1$branch, "")[[1]]
if (pos1path[1]!="0") {pos1path = c("0", pos1path)}
pos2path = strsplit(pos2$branch, "")[[1]]
if (pos2path[1]!="0") {pos2path = c("0", pos2path)}
loop = 1:min(length(pos1path), length(pos2path))
loop = loop[-which(loop == 1)]
CommonTrace="included"; for (i in loop) {
if (pos1path[i] != pos2path[i]) {
CommonTrace = i-1; break
}
}
if(CommonTrace=="included"){
CommonTrace = min(length(pos1path), length(pos2path))
if (length(pos1path) > length(pos2path)) {
longerpos = pos1; shorterpos = pos2; longerpospath = pos1path
} else {
longerpos = pos2; shorterpos = pos1; longerpospath = pos2path
}
distToNode = 0
if ((CommonTrace+1) != length(longerpospath)){
for (i in (CommonTrace+1):(length(longerpospath)-1)){
distToNode = distToNode + tree$length[tree$branch == paste0(longerpospath[2:i], collapse='')]
}
}
Dist = distToNode + longerpos$length + (tree[tree$branch == shorterpos$branch,]$length-shorterpos$length)
if (identical(shorterpos, pos1)){Dist=-Dist}
return(Dist)
} elseĀ { # if they are sisterbranch
Dist=0
if((CommonTrace+1) != length(pos1path)){
for (i in (CommonTrace+1):(length(pos1path)-1)){
Dist = Dist + tree$length[tree$branch == paste0(pos1path[2:i], collapse='')]
}
}
if((CommonTrace+1) != length(pos2path)){
for (i in (CommonTrace+1):(length(pos2path)-1)){
Dist = Dist + tree$length[tree$branch == paste(pos2path[2:i], collapse='')]
}
}
Dist = Dist + pos1$length + pos2$length
return(Dist)
}
}
I think the algorithm works fine but it is not very efficient. Note the sign of the distance that is important. This sign only makes sense when the two positions are not found on "sister branches". That is the sign makes sense only if one of the two positions is found in the way between the roots and the other position.
distance(tree, posA, posB) # -22
I then just loop through all positions of interest like that:
allpositions=rbind(tree, tree)
allpositions$length = c(1,5,8,2,2,3,5,6,7,8,2,3,1,2,5,6)
mat = matrix(-1, ncol=nrow(allpositions), nrow=nrow(allpositions))
for (i in 1:nrow(allpositions)){
for (j in 1:nrow(allpositions)){
posA = allpositions[i,]
posB = allpositions[j,]
mat[i,j] = distance(tree, posA, posB)
}
}
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1 0 -24 -39 -21 -40 -53 -55 -44 -6 -27 -33 -22 -39 -52 -55 -44
# 2 24 0 -15 7 26 39 41 30 18 -3 -9 8 25 38 41 30
# 3 39 15 0 22 41 54 56 45 33 12 6 23 40 53 56 45
# 4 21 7 22 0 -19 -32 -34 -23 15 10 16 -1 -18 -31 -34 -23
# 5 40 26 41 19 0 -13 -15 8 34 29 35 18 1 -12 -15 8
# 6 53 39 54 32 13 0 8 21 47 42 48 31 14 1 8 21
# 7 55 41 56 34 15 8 0 23 49 44 50 33 16 7 0 23
# 8 44 30 45 23 8 21 23 0 38 33 39 22 7 20 23 0
# 9 6 -18 -33 -15 -34 -47 -49 -38 0 -21 -27 -16 -33 -46 -49 -38
# 10 27 3 -12 10 29 42 44 33 21 0 -6 11 28 41 44 33
# 11 33 9 -6 16 35 48 50 39 27 6 0 17 34 47 50 39
# 12 22 8 23 1 -18 -31 -33 -22 16 11 17 0 -17 -30 -33 -22
# 13 39 25 40 18 -1 -14 -16 7 33 28 34 17 0 -13 -16 7
# 14 52 38 53 31 12 -1 7 20 46 41 47 30 13 0 7 20
# 15 55 41 56 34 15 8 0 23 49 44 50 33 16 7 0 23
# 16 44 30 45 23 8 21 23 0 38 33 39 22 7 20 23 0
As an example, let's consider the first and the third positions in the object allpositions. The distance between them is 39 (and -39) because an ant would need to walk 19 units on branch 0 and then walk 12 units on branch L and finally the ant would need to walk 8 units on branch LL. 19 + 12 + 8 = 39
The issue is that I have about 20 very big trees with about 50000 positions and I would like to calculate the distance between any two positions. There are therefore 20 * 50000^2 distances to compute. It takes forever! Can you help me to improve my code?
EDIT
Please let me know if anything is still unclear
tree is a description of a tree. The tree has branches of a certain length. The name of the branches (variable: branch) gives indication about the relationship between the branches. The branch RL is a "parent branch" of the two branches RLL and RLR, where R and L stand for right and left.
allpositions is an data.frame, where each line represents one independent position on the tree. You can think of the position of a squirrel. The position is defined by two information. 1) The branch (variable: branch) on which the squirrel is standing and the the distance between the beginning of the branch and the position of the squirrel (variable: length).
Three examples
Consider a first squirrel that is at position (variable: length) 8 on the branch RL (which length is 12) and a second squirrel that is at position (variable: length) 2 on the branch RLL or RLR. The distance between the two squirrels is 12 - 8 + 2 = 6 (or -6).
Consider a first squirrel that is at position (variable: length) 8 on the branch RL and a second squirrel that is at position (variable: length) 2 on the branch RR. The distance between the two squirrels is 8 + 2 = 10 (or -10).
Consider a first squirrel that is at position (variable: length) 8 on the branch R (which length is 19) and a second squirrel that is at position (variable: length) 2 on the branch RLL. Knowing the that branch RL has a length of 12, the distance between the two squirrels is 19 - 8 + 12 + 2 = 25 (or -25).
The code below uses the igraph package to compute the distances between positions in tree and seems noticeably faster than the code you posted in your question. The approach is to create graph vertices at branch intersections and at positions along tree branches at the positions specified in allpositions. Graph edges are the branch segments between these vertices. It uses igraph to build a graph for the tree and allpositions and then finds the distances between the vertices corresponding to allposition data.
t.graph <- function(tree, positions) {
library(igraph)
# Assign vertex name to tree branch intersections
n_label <- nchar(tree$branch)
tree$high_vert <- tree$branch
tree$low_vert <- tree$branch
tree$brnch_type <- "tree"
for( i in 1:nrow(tree) ) {
tree$low_vert[i] <- if(n_label[i] > 1) substr(tree$branch[i], 1, n_label[i]-1)
else { if(tree$branch[i] %in% c("R","L")) "0"
else "root" }
}
# combine position data with tree data
positions$brnch_type <- "position"
temp <- merge(positions, tree, by = "branch")
positions <- temp[, c("branch","length.x","high_vert","low_vert","brnch_type.x")]
positions$high_vert <- paste(positions$branch, positions$length.x, sep="_")
colnames(positions) <- c("branch","length","high_vert","low_vert","brnch_type")
tree <- rbind(tree, positions)
# use positions to segment tree branches
tree_brnch <- split(tree, tree$branch)
tree <- data.frame( branch=NA_character_, length = NA_real_, high_vert = NA_character_,
low_vert = NA_character_, brnch_type =NA_character_, seg_len= NA_real_)
for( ib in 1: length(tree_brnch)) {
brnch_seg <- tree_brnch[[ib]][order(tree_brnch[[ib]]$length, decreasing=TRUE), ]
n_seg <- nrow(brnch_seg)
brnch_seg$seg_len <- brnch_seg$length
for( is in 1:(n_seg-1) ) {
brnch_seg$seg_len[is] <- brnch_seg$length[is] - brnch_seg$length[is+1]
brnch_seg$low_vert[is] <- brnch_seg$high_vert[is+1]
}
tree <- rbind(tree, brnch_seg)
}
tree <- tree[-1,]
# Create graph of tree and positions
tree_graph <- graph.data.frame(tree[,c("low_vert","high_vert")])
E(tree_graph)$label <- tree$high_vert
E(tree_graph)$brnch_type <- tree$brnch_type
E(tree_graph)$weight <- tree$seg_len
# calculate shortest distances between position vertices
position_verts <- V(tree_graph)[grep("_", V(tree_graph)$name)]
vert_dist <- shortest.paths(tree_graph, v=position_verts, to=position_verts, mode="all")
return(dist_mat= vert_dist )
}
I've benchmarked igraph code ( the t.graph function) against the code posted in your question by making a function named Remi for your code over allposition data using your distance function. Sample trees were created as extensions of your tree and allpositions data for trees of 64, 256, and 2048 branches and allpositions equal to twice these sizes. Comparisons of execution times are shown below. Notice that times are in milliseconds.
microbenchmark(matR16 <- Remi(tree, allpositions), matG16 <- t.graph(tree, allpositions),
matR256 <- Remi(tree256, allpositions256), matG256 <- t.graph(tree256, allpositions256), times=2)
Unit: milliseconds
expr min lq mean median uq max neval
matR8 <- Remi(tree, allpositions) 58.82173 58.82173 59.92444 59.92444 61.02714 61.02714 2
matG8 <- t.graph(tree, allpositions) 11.82064 11.82064 13.15275 13.15275 14.48486 14.48486 2
matR256 <- Remi(tree256, allpositions256) 114795.50865 114795.50865 114838.99490 114838.99490 114882.48114 114882.48114 2
matG256 <- t.graph(tree256, allpositions256) 379.54559 379.54559 379.76673 379.76673 379.98787 379.98787 2
Compared to the code you posted, the igraph results are only about 5 times faster for the 8 branch case but are over 300 times faster for 256 branches so igraph seems to scale better with size. I've also benchmarked the igraph code for the 2048 branch case with the following results. Again times are in milliseconds.
microbenchmark(matG8 <- t.graph(tree, allpositions), matG64 <- t.graph(tree64, allpositions64),
matG256 <- t.graph(tree256, allpositions256), matG2k <- t.graph(tree2k, allpositions2k), times=2)
Unit: milliseconds
expr min lq mean median uq max neval
matG8 <- t.graph(tree, allpositions) 11.78072 11.78072 12.00599 12.00599 12.23126 12.23126 2
matG64 <- t.graph(tree64, allpositions64) 73.29006 73.29006 73.49409 73.49409 73.69812 73.69812 2
matG256 <- t.graph(tree256, allpositions256) 377.21756 377.21756 410.01268 410.01268 442.80780 442.80780 2
matG2k <- t.graph(tree2k, allpositions2k) 11311.05758 11311.05758 11362.93701 11362.93701 11414.81645 11414.81645 2
so the distance matrix for about 4000 positions is calculated in less than 12 seconds.
t.graph returns the distance matrix where the rows and columns of the matrix are labeled by branch names - position on the branch so for example
0_7 0_1 L_8 L_5 LL_8 LL_2 R_3 R_2 RL_2 RL_1 RLL_3 RLL_2 RLR_5 RR_6
L_5 18 24 3 0 15 9 8 7 26 25 39 38 41 30
shows the distances from L-5, the position 5 units along the L branch, to the other positions.
I don't know that this will handle your largest cases, but it may be helpful for some. You also have problems with the storage requirements for your largest cases.

R error type "Subscript out of bounds"

I am simulating a correlation matrix, where the 60 variables correlate in the following way:
more highly (0.6) for every two variables (1-2, 3-4... 59-60)
moderate (0.3) for every group of 12 variables (1-12,13-24...)
mc <- matrix(0,60,60)
diag(mc) <- 1
for (c in seq(1,59,2)){ # every pair of variables in order are given 0.6 correlation
mc[c,c+1] <- 0.6
mc[c+1,c] <- 0.6
}
for (n in seq(1,51,10)){ # every group of 12 are given correlation of 0.3
for (w in seq(12,60,12)){ # these are variables 11-12, 21-22 and such.
mc[n:n+1,c(n+2,w)] <- 0.2
mc[c(n+2,w),n:n+1] <- 0.2
}
}
for (m in seq(3,9,2)){ # every group of 12 are given correlation of 0.3
for (w in seq(12,60,12)){ # these variables are the rest.
mc[m:m+1,c(1:m-1,m+2:w)] <- 0.2
mc[c(1:m-1,m+2:w),m:m+1] <- 0.2
}
}
The first loop works well, but not the second and third ones. I get this error message:
Error in `[<-`(`*tmp*`, m:m + 1, c(1:m - 1, m + 2:w), value = 0.2) :
subscript out of bounds
Error in `[<-`(`*tmp*`, m:m + 1, c(1:m - 1, m + 2:w), value = 0.2) :
subscript out of bounds
I would really appreciate any hints, since I don't see the loop commands get to exceed the matrix dimensions. Thanks a lot in advance!
Note that : takes precedence over +. E.g., n:n+1 is the same as n+1. I guess you want n:(n+1).
The maximal value of w is 60:
w <- 60
m <- 1
m+2:w
#[1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#[49] 51 52 53 54 55 56 57 58 59 60 61
And 61 is out of bounds. You need to add a lot of parentheses.

Resources