How to index coordinate data according to a grid in R?

How to index coordinate data according to a grid in R? - r

I have coordinate data (x-coordinates and y-coordinates) on a scale between:
Xpos: 27-1367nm, Ypos: 67-1014nm. A data set consists of about 2500-3500 data points.
Here is the header of such a data set:
XPos YPos
1 29 211
2 31 609
3 33 1001
4 35 508
5 37 424
6 39 584
7 40 378
8 41 204
9 41 444
10 41 872
...
[![Data plotted][1]][1]
Now I would like to index the data points by applying a grid consisting of equal sized quadrants onto the data in R. The result should be a new column "grid_index" containing a unique quadrant_ID in which the data points are located (see image). Is there an easy way to do this? I would like to try different grid unit sizes to partition the data e.g. quadrants sized 50nm, 100nm, 200nm or 400nm and rectangles sized 100nm x 200nm or 50nm x100nm.
[![Grid for data pint indexing][2]][2]
[![Each grid quadrant should have an unique ID][3]][3]
I would be very grateful for any help.

Here's an approach with findInterval:
First set up a matrix that has the appropriate number of indices:
pos.matrix <- matrix(1:35,byrow = TRUE, nrow = 5)
pos.matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 2 3 4 5 6 7
[2,] 8 9 10 11 12 13 14
[3,] 15 16 17 18 19 20 21
[4,] 22 23 24 25 26 27 28
[5,] 29 30 31 32 33 34 35
Next use findInterval to find the indices of the matrix of where it lies. You can control the size of the grid using the by = argument. Note that the dimensions of the matrix must match the number of intervals provided in findInterval. We need to use abs because the y values are decreasing on the graph.
grid <- apply(cbind(findInterval(data[,"XPos"],seq(0,1400,by = 200)),
abs(findInterval(data[,"YPos"],seq(0,1000,by = 200)) - 6)),
MARGIN = 1,
function(x) pos.matrix[x[2],x[1]])
grid[1:25]
[1] 30 34 31 17 19 26 15 31 19 5 18 32 25 25 14 20 22 19 35 2 16 8 29 29 16
plot(NA,xlim = c(0,1400), ylim = c(0,1000), xlab = "XPos", ylab = "YPos", cex.axis = 0.8)
text(data[,1],data[,2], labels = grid, cex = 0.4)
Sample Data
set.seed(3)
data <- data.frame(XPos = runif(1000,0,1400), YPos = runif(1000,0,1000))

Related

test for in-tile for Dirichlet tile, using R

So I can take points and use the R libraries deldir or spatstat::dirichlet to find the dirichlet tesselation of those points.
Now I have a point not in the set, and I want to know the indices of the points forming the dirichlet tile which my not-in-set-point is interior to. I can get there by knowing the tile label (or index).
Are there any libraries or methods to do this? I'm thinking spatstat, but not finding something there yet.

The function cut.ppp() can take a point pattern and find which tesselation
tile each point in the pattern belongs to. Below is the code for a simple
example of a point pattern that only contains a single point (0.5, 0.5).
library(spatstat)
dd <- dirichlet(cells)
plot.tess(dd, do.labels = TRUE)
xx <- ppp(.5, .5, window = Window(dd))
plot(xx, add = TRUE, col = "red", cex = 2, pch = 20)
yy <- cut(xx, dd)
yy
#> Marked planar point pattern: 1 point
#> Multitype, with levels =
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
#> 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
#> window: rectangle = [0, 1] x [0, 1] units
marks(yy)
#> [1] 18
#> 42 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ... 42
Created on 2018-12-03 by the reprex package (v0.2.1)

If X is a point pattern and B is a tessellation, then
M <- marks(cut(X, B))
returns a factor (vector of categorical values) identifying which tile contains each of the points of X. Alternatively,
M <- tileindex(X$x, X$y, B)
or
f <- as.function(B)
M <- f(X)

How can I create a matrix , with random number on row and not replace,but in col can replace, R language

How can I create a matrix , with random number on row and not replace.
like this
5 29 24 20 31 33
2 18 35 4 11 21
30 40 22 14 2 28
33 14 4 18 5 10
10 33 15 2 28 18
7 22 9 25 31 20
12 29 31 22 37 26
7 31 34 28 19 23
7 34 11 6 31 28
my code :
matrix(sample(1:42, 60, replace = FALSE), ncol = 6)
But I receive this error message:
Error in sample.int(length(x), size, replace, prob) : cannot take a
sample larger than the population when 'replace = FALSE'
but it's wrong because only 1~42, it can't create a 60 matrix.

You can not generate all 60 of the numbers with one sample function as you want to allow replacement of numbers in a different row. Therefore you have to do one sample per row. #Jav provided very neat code to accomplish this in the comment to the question:
t(sapply(1:10, function(x) sample(1:42, 6, replace = FALSE)))

if you want to have a different sample in each row, then replicate can help you -- but replicate (as pretty much everything else in R) works naturally columnwise, so you have to transpose the result:
t(replicate(10, sample(1:42, 6)))
replace = FALSE is the default, so I didn't include it
after transposing, 10 becomes the number of rows and 6 becomes the number of columns

Ceil and floor values in R

I have a data.table of integers with values between 1 and 60.
My question is about flooring or ceiling any number to the following values: 12 18 24 30 36 ... 60.
For example, let's say my data.table contains the number 13. I want R to "transform" this number into 12 and 18 as 13 lies in between those numbers. Moreover, if I have 18 I want R to keep it at 18.
If my data.table contains the value 50, I want R to convert that number into 48 and 54 and so on.
My goal is to get two different data.tables. One where the floored values are saved and one where the ceiled values are saved.
Any idea how one could do this in R?
EDIT: Numbers smaller than 12 should always be transformed to 12.
Example output:
If have the following data.table data.table(c(1,28,29,41,53,53,17,41,41,53))
I want the following two output data.tables: floored values data.table(c(12,24,24,36,48,48,12,36,36,48))
I want the following two output data.tables: ceiled values data.table(c(12,30,30,42,54,54,18,42,42,54))

Here is a fairly direct way (edited to round up to 12 if any values are below):
df <- data.frame(nums = 10:20)
df$floors <- with(df,pmax(12,6*floor(nums/6)))
df$ceils <- with(df,pmax(12,6*ceiling(nums/6)))
Leading to:
> df
nums floors ceils
1 10 12 12
2 11 12 12
3 12 12 12
4 13 12 18
5 14 12 18
6 15 12 18
7 16 12 18
8 17 12 18
9 18 18 18
10 19 18 24
11 20 18 24

Here's a way we could do this, using sapply and the which.min functions. From your question, it's not immediately clear how values < 12 should be handled.
x <- 1:60
num_list <- seq(12, 60, 6)
floorr <- sapply(x, function(x){
diff_vec <- x - num_list
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
ceill <- sapply(x, function(x){
diff_vec <- num_list - x
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
tail(cbind(x, floorr, ceill))
x floorr ceill
[55,] 55 54 60
[56,] 56 54 60
[57,] 57 54 60
[58,] 58 54 60
[59,] 59 54 60
[60,] 60 60 60

Aesthetics must be either length 1 or the same as the data: ymin, ymax, x, y, colour, when using a second geom_errorbar function

I'm trying to add error bars to a second curve (using dataset "pmfprofbs01"), but I'm having problems and I couldn't fix this.
There are a few threads on this error, but unfortunately it looks like every other answer is case specific, and I'm not able to overcome this error in my code. I am able to plot a first smoothed curve (stat_smooth) and overlapping errorbars (using geom_errobar). The problem rises when I try to add a second curve to the same graph, for comparison purposes.
With following code, I get the following error: "Error: Aesthetics must be either length 1 or the same as the data (35): ymin, ymax, x, y, colour"
I am looking to add additional errorbars to the second smoothed curve (corresponding to datasets pmfprof01 and pmfprofbs01).
Could someone explain why I keep getting this error? The code works until using the second call of geom_errorbar().
These are my 4 datasets (all used as data frames):
- pmfprof1 and pmfprof01 are the two datasets used for applying the smoothing method.
- pmfprofbs1 and pmfprofbs01 contain additional information based on an error analysis for plotting error bars.
> pmfprof1
Z correctedpmfprof1
1 -1.1023900 -8.025386e-22
2 -1.0570000 6.257110e-02
3 -1.0116000 1.251420e-01
4 -0.9662020 2.143170e-01
5 -0.9208040 3.300960e-01
6 -0.8754060 4.658550e-01
7 -0.8300090 6.113410e-01
8 -0.7846110 4.902430e-01
9 -0.7392140 3.344200e-01
10 -0.6938160 4.002040e-01
11 -0.6484190 1.215460e-01
12 -0.6030210 -1.724360e-01
13 -0.5576240 -6.077170e-01
14 -0.5122260 -1.513420e+00
15 -0.4668290 -2.075330e+00
16 -0.4214310 -2.617160e+00
17 -0.3760340 -3.350500e+00
18 -0.3306360 -4.076220e+00
19 -0.2852380 -4.926540e+00
20 -0.2398410 -5.826390e+00
21 -0.1944430 -6.761300e+00
22 -0.1490460 -7.301530e+00
23 -0.1036480 -7.303880e+00
24 -0.0582507 -7.026800e+00
25 -0.0128532 -6.627960e+00
26 0.0325444 -6.651490e+00
27 0.0779419 -6.919830e+00
28 0.1233390 -6.686490e+00
29 0.1687370 -6.129060e+00
30 0.2141350 -6.120890e+00
31 0.2595320 -6.455160e+00
32 0.3049300 -6.554560e+00
33 0.3503270 -6.983390e+00
34 0.3957250 -7.413500e+00
35 0.4411220 -6.697370e+00
36 0.4865200 -5.477230e+00
37 0.5319170 -4.552890e+00
38 0.5773150 -3.393060e+00
39 0.6227120 -2.449930e+00
40 0.6681100 -2.183190e+00
41 0.7135080 -1.673980e+00
42 0.7589050 -8.003740e-01
43 0.8043030 -2.918780e-01
44 0.8497000 -1.159710e-01
45 0.8950980 9.123767e-22
> pmfprof01
Z correctedpmfprof01
1 -1.25634000 -1.878749e-21
2 -1.20387000 -1.750190e-01
3 -1.15141000 -3.500380e-01
4 -1.09894000 -6.005650e-01
5 -1.04647000 -7.935110e-01
6 -0.99400600 -8.626150e-01
7 -0.94153900 -1.313880e+00
8 -0.88907200 -2.067770e+00
9 -0.83660500 -2.662440e+00
10 -0.78413800 -4.514190e+00
11 -0.73167100 -7.989510e+00
12 -0.67920400 -1.186870e+01
13 -0.62673800 -1.535970e+01
14 -0.57427100 -1.829150e+01
15 -0.52180400 -2.067170e+01
16 -0.46933700 -2.167890e+01
17 -0.41687000 -2.069820e+01
18 -0.36440300 -1.662640e+01
19 -0.31193600 -1.265950e+01
20 -0.25946900 -1.182580e+01
21 -0.20700200 -1.213370e+01
22 -0.15453500 -1.233680e+01
23 -0.10206800 -1.235160e+01
24 -0.04960160 -1.123630e+01
25 0.00286531 -9.086940e+00
26 0.05533220 -6.562710e+00
27 0.10779900 -4.185860e+00
28 0.16026600 -3.087430e+00
29 0.21273300 -2.005150e+00
30 0.26520000 -9.295540e-02
31 0.31766700 1.450360e+00
32 0.37013400 1.123910e+00
33 0.42260100 2.426750e-01
34 0.47506700 1.213370e-01
35 0.52753400 5.265226e-21
> pmfprofbs1
Z correctedpmfprof01 bsmean bssd bsse bsci
1 -1.1023900 -8.025386e-22 0.00000000 0.0000000 0.00000000 0.0000000
2 -1.0570000 6.257110e-02 1.46519200 0.6691245 0.09974719 0.2010273
3 -1.0116000 1.251420e-01 1.62453300 0.6368053 0.09492933 0.1913175
4 -0.9662020 2.143170e-01 1.62111600 0.7200497 0.10733867 0.2163269
5 -0.9208040 3.300960e-01 1.44754700 0.7236743 0.10787900 0.2174158
6 -0.8754060 4.658550e-01 1.67509800 0.7148755 0.10656735 0.2147724
7 -0.8300090 6.113410e-01 1.78144200 0.7374481 0.10993227 0.2215539
8 -0.7846110 4.902430e-01 1.73058700 0.7701354 0.11480501 0.2313743
9 -0.7392140 3.344200e-01 0.97430090 0.7809477 0.11641681 0.2346227
10 -0.6938160 4.002040e-01 1.26812000 0.8033838 0.11976139 0.2413632
11 -0.6484190 1.215460e-01 0.93601510 0.7927926 0.11818254 0.2381813
12 -0.6030210 -1.724360e-01 0.63201080 0.8210839 0.12239996 0.2466809
13 -0.5576240 -6.077170e-01 0.05952252 0.8653050 0.12899205 0.2599664
14 -0.5122260 -1.513420e+00 0.57893690 0.8858471 0.13205429 0.2661379
15 -0.4668290 -2.075330e+00 -0.08164613 0.8921298 0.13299086 0.2680255
16 -0.4214310 -2.617160e+00 -1.08074600 0.8906925 0.13277660 0.2675937
17 -0.3760340 -3.350500e+00 -1.67279700 0.9081813 0.13538367 0.2728479
18 -0.3306360 -4.076220e+00 -2.50074900 1.0641550 0.15863486 0.3197076
19 -0.2852380 -4.926540e+00 -3.12062200 1.0639080 0.15859804 0.3196333
20 -0.2398410 -5.826390e+00 -4.47060100 1.1320770 0.16876008 0.3401136
21 -0.1944430 -6.761300e+00 -5.40812700 1.1471780 0.17101120 0.3446504
22 -0.1490460 -7.301530e+00 -6.42419100 1.1685490 0.17419700 0.3510710
23 -0.1036480 -7.303880e+00 -5.79613500 1.1935850 0.17792915 0.3585926
24 -0.0582507 -7.026800e+00 -5.85496900 1.2117630 0.18063896 0.3640539
25 -0.0128532 -6.627960e+00 -6.70480400 1.1961400 0.17831002 0.3593602
26 0.0325444 -6.651490e+00 -8.27106200 1.3376870 0.19941060 0.4018857
27 0.0779419 -6.919830e+00 -8.79402900 1.3582760 0.20247983 0.4080713
28 0.1233390 -6.686490e+00 -8.35947700 1.3673080 0.20382624 0.4107848
29 0.1687370 -6.129060e+00 -8.04437600 1.3921620 0.20753126 0.4182518
30 0.2141350 -6.120890e+00 -8.18588300 1.5220550 0.22689456 0.4572759
31 0.2595320 -6.455160e+00 -8.37217600 1.5436800 0.23011823 0.4637728
32 0.3049300 -6.554560e+00 -8.59346400 1.6276880 0.24264140 0.4890116
33 0.3503270 -6.983390e+00 -8.88378700 1.6557140 0.24681927 0.4974316
34 0.3957250 -7.413500e+00 -9.72709800 1.6569390 0.24700188 0.4977996
35 0.4411220 -6.697370e+00 -9.46033400 1.6378470 0.24415582 0.4920637
36 0.4865200 -5.477230e+00 -8.37590600 1.6262700 0.24243002 0.4885856
37 0.5319170 -4.552890e+00 -7.52867000 1.6617010 0.24771176 0.4992302
38 0.5773150 -3.393060e+00 -6.89192300 1.6667330 0.24846189 0.5007420
39 0.6227120 -2.449930e+00 -6.25115300 1.6670390 0.24850750 0.5008340
40 0.6681100 -2.183190e+00 -6.05373800 1.6720180 0.24924973 0.5023298
41 0.7135080 -1.673980e+00 -5.10526700 1.6668400 0.24847784 0.5007742
42 0.7589050 -8.003740e-01 -4.42001600 1.6561830 0.24688918 0.4975725
43 0.8043030 -2.918780e-01 -4.26640200 1.6588970 0.24729376 0.4983878
44 0.8497000 -1.159710e-01 -4.46318500 1.6533830 0.24647179 0.4967312
45 0.8950980 9.123767e-22 -5.17173200 1.6557990 0.24683194 0.4974571
> pmfprofbs01
Z correctedpmfprof01 bsmean bssd bsse bsci
1 -1.25634000 -1.878749e-21 0.000000 0.0000000 0.00000000 0.0000000
2 -1.20387000 -1.750190e-01 2.316589 0.4646486 0.07853995 0.1596124
3 -1.15141000 -3.500380e-01 2.320647 0.4619668 0.07808664 0.1586911
4 -1.09894000 -6.005650e-01 2.635883 0.6519826 0.11020517 0.2239639
5 -1.04647000 -7.935110e-01 2.814679 0.6789875 0.11476983 0.2332404
6 -0.99400600 -8.626150e-01 2.588038 0.7324196 0.12380151 0.2515949
7 -0.94153900 -1.313880e+00 2.033736 0.7635401 0.12906183 0.2622852
8 -0.88907200 -2.067770e+00 2.394285 0.8120181 0.13725611 0.2789380
9 -0.83660500 -2.662440e+00 2.465425 0.9485307 0.16033095 0.3258317
10 -0.78413800 -4.514190e+00 0.998115 1.0177400 0.17202946 0.3496059
11 -0.73167100 -7.989510e+00 -1.585430 1.0502190 0.17751941 0.3607628
12 -0.67920400 -1.186870e+01 -5.740894 1.2281430 0.20759406 0.4218819
13 -0.62673800 -1.535970e+01 -9.325951 1.3289330 0.22463068 0.4565045
14 -0.57427100 -1.829150e+01 -12.010540 1.3279860 0.22447060 0.4561792
15 -0.52180400 -2.067170e+01 -14.672770 1.3296720 0.22475559 0.4567583
16 -0.46933700 -2.167890e+01 -14.912250 1.3192610 0.22299581 0.4531820
17 -0.41687000 -2.069820e+01 -12.850570 1.3288470 0.22461614 0.4564749
18 -0.36440300 -1.662640e+01 -6.093746 1.3497100 0.22814263 0.4636416
19 -0.31193600 -1.265950e+01 -5.210692 1.3602240 0.22991982 0.4672533
20 -0.25946900 -1.182580e+01 -6.041660 1.3818700 0.23357866 0.4746890
21 -0.20700200 -1.213370e+01 -5.765808 1.3854680 0.23418683 0.4759249
22 -0.15453500 -1.233680e+01 -6.985883 1.4025360 0.23707185 0.4817880
23 -0.10206800 -1.235160e+01 -7.152865 1.4224030 0.24042999 0.4886125
24 -0.04960160 -1.123630e+01 -3.600538 1.4122650 0.23871635 0.4851300
25 0.00286531 -9.086940e+00 -0.751673 1.5764920 0.26647578 0.5415439
26 0.05533220 -6.562710e+00 2.852910 1.5535620 0.26259991 0.5336672
27 0.10779900 -4.185860e+00 5.398850 1.5915640 0.26902342 0.5467214
28 0.16026600 -3.087430e+00 6.262459 1.6137360 0.27277117 0.5543377
29 0.21273300 -2.005150e+00 8.047920 1.6283340 0.27523868 0.5593523
30 0.26520000 -9.295540e-02 11.168640 1.6267620 0.27497297 0.5588123
31 0.31766700 1.450360e+00 12.345900 1.6363310 0.27659042 0.5620994
32 0.37013400 1.123910e+00 12.124650 1.6289230 0.27533824 0.5595546
33 0.42260100 2.426750e-01 11.279890 1.6137100 0.27276677 0.5543288
34 0.47506700 1.213370e-01 11.531670 1.6311490 0.27571450 0.5603193
35 0.52753400 5.265226e-21 11.284980 1.6662890 0.28165425 0.5723903
The code for plotting both curves is:
deltamean01<-pmfprofbs01[,"bsmean"]-
pmfprofbs01[,"correctedpmfprof01"]
correctmean01<-pmfprofbs01[,"bsmean"]-deltamean01
deltamean1<-pmfprofbs1[,"bsmean"]-
pmfprofbs1[,"correctedpmfprof1"]
correctmean1<-pmfprofbs1[,"bsmean"]-deltamean1
pl<- ggplot(pmfprof1, aes(x=pmfprof1[,1], y=pmfprof1[,2],
colour="red")) +
list(
stat_smooth(method = "gam", formula = y ~ s(x), size = 1,
colour="chartreuse3",fill="chartreuse3", alpha = 0.3),
geom_line(data=pmfprof1,linetype=4, size=0.5,colour="chartreuse3"),
geom_errorbar(aes(ymin=correctmean1-pmfprofbs1[,"bsci"],
ymax=correctmean1+pmfprofbs1[,"bsci"]),
data=pmfprofbs1,colour="chartreuse3",
width=0.02,size=0.9),
geom_point(data=pmfprof1,size=1,colour="chartreuse3"),
xlab(expression(xi*(nm))),
ylab("PMF (KJ/mol)"),
## GCD
geom_errorbar(aes(ymin=correctmean01-pmfprofbs01[,"bsci"],
ymax=correctmean01+pmfprofbs01[,"bsci"]),
data=pmfprofbs01,
width=0.02,size=0.9),
geom_line(data=pmfprof01,aes(x=pmfprof01[,1],y=pmfprof01[,2]),
linetype=4, size=0.5,colour="darkgreen"),
stat_smooth(data=pmfprof01,method = "gam",aes(x=pmfprof01[,1],pmfprof01[,2]),
formula = y ~ s(x), size = 1,
colour="darkgreen",fill="darkgreen", alpha = 0.3),
theme(text = element_text(size=20),
axis.text.x = element_text(size=20,colour="black"),
axis.text.y = element_text(size=20,colour="black")),
scale_x_continuous(breaks=number_ticks(8)),
scale_y_continuous(breaks=number_ticks(8)),
theme(panel.background = element_rect(fill ='white',
colour='gray')),
theme(plot.background = element_rect(fill='white',
colour='white')),
theme(legend.position="none"),
theme(legend.key = element_blank()),
theme(legend.title = element_text(colour='gray', size=20)),
NULL
)
pl
This is the result of using pl,
[enter image description here][1]
[1]: https://i.stack.imgur.com/x8FjY.png
Thanks in advance for any suggestion,

Performence for calculating the distance between two positions on a tree?

Here is a tree. The first column is an identifier for the branch, where 0 is the trunk, L is the first branch on the left and R is the first branch on the right. LL is the branch on the extreme left after the second bifurcation, etc.. the variable length contains the length of each branch.
> tree
branch length
1 0 20
2 L 12
3 LL 19
4 R 19
5 RL 12
6 RLL 10
7 RLR 12
8 RR 17
tree = data.frame(branch = c("0","L", "LL", "R", "RL", "RLL", "RLR", "RR"), length=c(20,12,19,19,12,10,12,17))
tree$branch = as.character(tree$branch)
and here is a drawing of this tree
Here are two positions on this tree
posA = tree[4,]; posA$length = 12
posB = tree[6,]; posB$length = 3
The positions are given by the branch ID and the distance (variable length) to the origin of the branch (more info in edits).
I wrote the following messy distance function to calculate the shortest distance along the branches between any two points on the tree. The shortest distance along the branches can be understood as the minimal distance an ant would need to walk along the branches to reach one position from the other position.
distance = function(tree, pos1, pos2){
if (identical(pos1$branch, pos2$branch)){Dist=pos1$length-pos2$length;return(Dist)}
pos1path = strsplit(pos1$branch, "")[[1]]
if (pos1path[1]!="0") {pos1path = c("0", pos1path)}
pos2path = strsplit(pos2$branch, "")[[1]]
if (pos2path[1]!="0") {pos2path = c("0", pos2path)}
loop = 1:min(length(pos1path), length(pos2path))
loop = loop[-which(loop == 1)]
CommonTrace="included"; for (i in loop) {
if (pos1path[i] != pos2path[i]) {
CommonTrace = i-1; break
}
}
if(CommonTrace=="included"){
CommonTrace = min(length(pos1path), length(pos2path))
if (length(pos1path) > length(pos2path)) {
longerpos = pos1; shorterpos = pos2; longerpospath = pos1path
} else {
longerpos = pos2; shorterpos = pos1; longerpospath = pos2path
}
distToNode = 0
if ((CommonTrace+1) != length(longerpospath)){
for (i in (CommonTrace+1):(length(longerpospath)-1)){
distToNode = distToNode + tree$length[tree$branch == paste0(longerpospath[2:i], collapse='')]
}
}
Dist = distToNode + longerpos$length + (tree[tree$branch == shorterpos$branch,]$length-shorterpos$length)
if (identical(shorterpos, pos1)){Dist=-Dist}
return(Dist)
} else { # if they are sisterbranch
Dist=0
if((CommonTrace+1) != length(pos1path)){
for (i in (CommonTrace+1):(length(pos1path)-1)){
Dist = Dist + tree$length[tree$branch == paste0(pos1path[2:i], collapse='')]
}
}
if((CommonTrace+1) != length(pos2path)){
for (i in (CommonTrace+1):(length(pos2path)-1)){
Dist = Dist + tree$length[tree$branch == paste(pos2path[2:i], collapse='')]
}
}
Dist = Dist + pos1$length + pos2$length
return(Dist)
}
}
I think the algorithm works fine but it is not very efficient. Note the sign of the distance that is important. This sign only makes sense when the two positions are not found on "sister branches". That is the sign makes sense only if one of the two positions is found in the way between the roots and the other position.
distance(tree, posA, posB) # -22
I then just loop through all positions of interest like that:
allpositions=rbind(tree, tree)
allpositions$length = c(1,5,8,2,2,3,5,6,7,8,2,3,1,2,5,6)
mat = matrix(-1, ncol=nrow(allpositions), nrow=nrow(allpositions))
for (i in 1:nrow(allpositions)){
for (j in 1:nrow(allpositions)){
posA = allpositions[i,]
posB = allpositions[j,]
mat[i,j] = distance(tree, posA, posB)
}
}
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1 0 -24 -39 -21 -40 -53 -55 -44 -6 -27 -33 -22 -39 -52 -55 -44
# 2 24 0 -15 7 26 39 41 30 18 -3 -9 8 25 38 41 30
# 3 39 15 0 22 41 54 56 45 33 12 6 23 40 53 56 45
# 4 21 7 22 0 -19 -32 -34 -23 15 10 16 -1 -18 -31 -34 -23
# 5 40 26 41 19 0 -13 -15 8 34 29 35 18 1 -12 -15 8
# 6 53 39 54 32 13 0 8 21 47 42 48 31 14 1 8 21
# 7 55 41 56 34 15 8 0 23 49 44 50 33 16 7 0 23
# 8 44 30 45 23 8 21 23 0 38 33 39 22 7 20 23 0
# 9 6 -18 -33 -15 -34 -47 -49 -38 0 -21 -27 -16 -33 -46 -49 -38
# 10 27 3 -12 10 29 42 44 33 21 0 -6 11 28 41 44 33
# 11 33 9 -6 16 35 48 50 39 27 6 0 17 34 47 50 39
# 12 22 8 23 1 -18 -31 -33 -22 16 11 17 0 -17 -30 -33 -22
# 13 39 25 40 18 -1 -14 -16 7 33 28 34 17 0 -13 -16 7
# 14 52 38 53 31 12 -1 7 20 46 41 47 30 13 0 7 20
# 15 55 41 56 34 15 8 0 23 49 44 50 33 16 7 0 23
# 16 44 30 45 23 8 21 23 0 38 33 39 22 7 20 23 0
As an example, let's consider the first and the third positions in the object allpositions. The distance between them is 39 (and -39) because an ant would need to walk 19 units on branch 0 and then walk 12 units on branch L and finally the ant would need to walk 8 units on branch LL. 19 + 12 + 8 = 39
The issue is that I have about 20 very big trees with about 50000 positions and I would like to calculate the distance between any two positions. There are therefore 20 * 50000^2 distances to compute. It takes forever! Can you help me to improve my code?
EDIT
Please let me know if anything is still unclear
tree is a description of a tree. The tree has branches of a certain length. The name of the branches (variable: branch) gives indication about the relationship between the branches. The branch RL is a "parent branch" of the two branches RLL and RLR, where R and L stand for right and left.
allpositions is an data.frame, where each line represents one independent position on the tree. You can think of the position of a squirrel. The position is defined by two information. 1) The branch (variable: branch) on which the squirrel is standing and the the distance between the beginning of the branch and the position of the squirrel (variable: length).
Three examples
Consider a first squirrel that is at position (variable: length) 8 on the branch RL (which length is 12) and a second squirrel that is at position (variable: length) 2 on the branch RLL or RLR. The distance between the two squirrels is 12 - 8 + 2 = 6 (or -6).
Consider a first squirrel that is at position (variable: length) 8 on the branch RL and a second squirrel that is at position (variable: length) 2 on the branch RR. The distance between the two squirrels is 8 + 2 = 10 (or -10).
Consider a first squirrel that is at position (variable: length) 8 on the branch R (which length is 19) and a second squirrel that is at position (variable: length) 2 on the branch RLL. Knowing the that branch RL has a length of 12, the distance between the two squirrels is 19 - 8 + 12 + 2 = 25 (or -25).

The code below uses the igraph package to compute the distances between positions in tree and seems noticeably faster than the code you posted in your question. The approach is to create graph vertices at branch intersections and at positions along tree branches at the positions specified in allpositions. Graph edges are the branch segments between these vertices. It uses igraph to build a graph for the tree and allpositions and then finds the distances between the vertices corresponding to allposition data.
t.graph <- function(tree, positions) {
library(igraph)
# Assign vertex name to tree branch intersections
n_label <- nchar(tree$branch)
tree$high_vert <- tree$branch
tree$low_vert <- tree$branch
tree$brnch_type <- "tree"
for( i in 1:nrow(tree) ) {
tree$low_vert[i] <- if(n_label[i] > 1) substr(tree$branch[i], 1, n_label[i]-1)
else { if(tree$branch[i] %in% c("R","L")) "0"
else "root" }
}
# combine position data with tree data
positions$brnch_type <- "position"
temp <- merge(positions, tree, by = "branch")
positions <- temp[, c("branch","length.x","high_vert","low_vert","brnch_type.x")]
positions$high_vert <- paste(positions$branch, positions$length.x, sep="_")
colnames(positions) <- c("branch","length","high_vert","low_vert","brnch_type")
tree <- rbind(tree, positions)
# use positions to segment tree branches
tree_brnch <- split(tree, tree$branch)
tree <- data.frame( branch=NA_character_, length = NA_real_, high_vert = NA_character_,
low_vert = NA_character_, brnch_type =NA_character_, seg_len= NA_real_)
for( ib in 1: length(tree_brnch)) {
brnch_seg <- tree_brnch[[ib]][order(tree_brnch[[ib]]$length, decreasing=TRUE), ]
n_seg <- nrow(brnch_seg)
brnch_seg$seg_len <- brnch_seg$length
for( is in 1:(n_seg-1) ) {
brnch_seg$seg_len[is] <- brnch_seg$length[is] - brnch_seg$length[is+1]
brnch_seg$low_vert[is] <- brnch_seg$high_vert[is+1]
}
tree <- rbind(tree, brnch_seg)
}
tree <- tree[-1,]
# Create graph of tree and positions
tree_graph <- graph.data.frame(tree[,c("low_vert","high_vert")])
E(tree_graph)$label <- tree$high_vert
E(tree_graph)$brnch_type <- tree$brnch_type
E(tree_graph)$weight <- tree$seg_len
# calculate shortest distances between position vertices
position_verts <- V(tree_graph)[grep("_", V(tree_graph)$name)]
vert_dist <- shortest.paths(tree_graph, v=position_verts, to=position_verts, mode="all")
return(dist_mat= vert_dist )
}
I've benchmarked igraph code ( the t.graph function) against the code posted in your question by making a function named Remi for your code over allposition data using your distance function. Sample trees were created as extensions of your tree and allpositions data for trees of 64, 256, and 2048 branches and allpositions equal to twice these sizes. Comparisons of execution times are shown below. Notice that times are in milliseconds.
microbenchmark(matR16 <- Remi(tree, allpositions), matG16 <- t.graph(tree, allpositions),
matR256 <- Remi(tree256, allpositions256), matG256 <- t.graph(tree256, allpositions256), times=2)
Unit: milliseconds
expr min lq mean median uq max neval
matR8 <- Remi(tree, allpositions) 58.82173 58.82173 59.92444 59.92444 61.02714 61.02714 2
matG8 <- t.graph(tree, allpositions) 11.82064 11.82064 13.15275 13.15275 14.48486 14.48486 2
matR256 <- Remi(tree256, allpositions256) 114795.50865 114795.50865 114838.99490 114838.99490 114882.48114 114882.48114 2
matG256 <- t.graph(tree256, allpositions256) 379.54559 379.54559 379.76673 379.76673 379.98787 379.98787 2
Compared to the code you posted, the igraph results are only about 5 times faster for the 8 branch case but are over 300 times faster for 256 branches so igraph seems to scale better with size. I've also benchmarked the igraph code for the 2048 branch case with the following results. Again times are in milliseconds.
microbenchmark(matG8 <- t.graph(tree, allpositions), matG64 <- t.graph(tree64, allpositions64),
matG256 <- t.graph(tree256, allpositions256), matG2k <- t.graph(tree2k, allpositions2k), times=2)
Unit: milliseconds
expr min lq mean median uq max neval
matG8 <- t.graph(tree, allpositions) 11.78072 11.78072 12.00599 12.00599 12.23126 12.23126 2
matG64 <- t.graph(tree64, allpositions64) 73.29006 73.29006 73.49409 73.49409 73.69812 73.69812 2
matG256 <- t.graph(tree256, allpositions256) 377.21756 377.21756 410.01268 410.01268 442.80780 442.80780 2
matG2k <- t.graph(tree2k, allpositions2k) 11311.05758 11311.05758 11362.93701 11362.93701 11414.81645 11414.81645 2
so the distance matrix for about 4000 positions is calculated in less than 12 seconds.
t.graph returns the distance matrix where the rows and columns of the matrix are labeled by branch names - position on the branch so for example
0_7 0_1 L_8 L_5 LL_8 LL_2 R_3 R_2 RL_2 RL_1 RLL_3 RLL_2 RLR_5 RR_6
L_5 18 24 3 0 15 9 8 7 26 25 39 38 41 30
shows the distances from L-5, the position 5 units along the L branch, to the other positions.
I don't know that this will handle your largest cases, but it may be helpful for some. You also have problems with the storage requirements for your largest cases.