I'm afraid I'm missing something obvious, but I just can't see what I am doing wrong.
If anyone can help me find it, please, it would be great.
Here's the full, symmetrical distance matrix I'm starting from:
d2 <- structure(list(P1 = c(0, 0.1, 0.3, 0.2, 0, 0.1), P2 = c(0.1,
0, 0.5, 0.7, 1, 0.9), P3 = c(0.3, 0.5, 0, 1, 0.2, 0.3), P4 = c(0.2,
0.7, 1, 0, 0.2, 0.5), P5 = c(0, 1, 0.2, 0.2, 0, 0.7), P6 = c(0.1,
0.9, 0.3, 0.5, 0.7, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))
sum(abs(d2-t(d2)))
#[1] 0
I want to generate coordinates for the corresponding 6 points, so that the (euclidean) distance matrix resulting from those coordinates is as close as possible to my d2.
From the cmdscale documentation:
A set of Euclidean distances on n points can be represented exactly in at most n - 1 dimensions.
I would have thought (n-1)/2 dimensions would suffice, and indeed, when I run cmdscale, if I go anywhere higher than k=3 I get something close to 0 for the higher coordinates, or even error messages:
cmdscale(d2,k=3)
# [,1] [,2] [,3]
#[1,] -0.03526127 0.07755701 1.708755e-05
#[2,] -0.50626939 0.31256816 -5.646907e-02
#[3,] -0.26333957 -0.40518119 -6.978213e-02
#[4,] 0.35902238 0.37455879 2.148406e-02
#[5,] 0.33997864 -0.17998635 -2.809260e-01
#[6,] 0.10586921 -0.17951643 3.856760e-01
cmdscale(d2,k=4)
# [,1] [,2] [,3] [,4]
#[1,] -0.03526127 0.07755701 1.708755e-05 -7.450581e-09
#[2,] -0.50626939 0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,] 0.35902238 0.37455879 2.148406e-02 -7.450581e-09
#[5,] 0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,] 0.10586921 -0.17951643 3.856760e-01 -7.450581e-09
cmdscale(d2,k=5)
# [,1] [,2] [,3] [,4]
#[1,] -0.03526127 0.07755701 1.708755e-05 -7.450581e-09
#[2,] -0.50626939 0.31256816 -5.646907e-02 -7.450581e-09
#[3,] -0.26333957 -0.40518119 -6.978213e-02 -7.450581e-09
#[4,] 0.35902238 0.37455879 2.148406e-02 -7.450581e-09
#[5,] 0.33997864 -0.17998635 -2.809260e-01 -7.450581e-09
#[6,] 0.10586921 -0.17951643 3.856760e-01 -7.450581e-09
#Warning message:
#In cmdscale(d2, k = 5) : only 4 of the first 5 eigenvalues are > 0
So, assuming that k=3 is sufficient, this is what happens when I try to reverse the operation:
dd <- dist(cmdscale(d2,k=3),diag = T,upper = T)
dd
# 1 2 3 4 5 6
#1 0.0000000 0.5294049 0.5384495 0.4940956 0.5348482 0.4844970
#2 0.5294049 0.0000000 0.7578630 0.8710048 1.0045529 0.9013064
#3 0.5384495 0.7578630 0.0000000 1.0018275 0.6777074 0.6282371
#4 0.4940956 0.8710048 1.0018275 0.0000000 0.6319294 0.7097335
#5 0.5348482 1.0045529 0.6777074 0.6319294 0.0000000 0.7065166
#6 0.4844970 0.9013064 0.6282371 0.7097335 0.7065166 0.0000000
Which is quite different from what I expected:
as.matrix(dd)-d2
# P1 P2 P3 P4 P5 P6
#1 0.0000000 0.429404930 0.238449457 0.294095619 0.534848178 0.384497043
#2 0.4294049 0.000000000 0.257862963 0.171004810 0.004552925 0.001306386
#3 0.2384495 0.257862963 0.000000000 0.001827507 0.477707386 0.328237091
#4 0.2940956 0.171004810 0.001827507 0.000000000 0.431929428 0.209733518
#5 0.5348482 0.004552925 0.477707386 0.431929428 0.000000000 0.006516573
#6 0.3844970 0.001306386 0.328237091 0.209733518 0.006516573 0.000000000
sum(abs(as.matrix(dd)-d2))
#[1] 7.543948
Has anyone got any idea why the two distance matrices don't match at all?
I could try building my own least squares problem to find the coordinates, but first I need to understand if I'm doing something wrong with these out of the box R functionalities.
Thanks!
EDIT possible inconsistency in the data found
Could the issue be that according to d2 points 1 and 5 coincide (they have distance 0):
as.matrix(d2)
# P1 P2 P3 P4 P5 P6
#[1,] 0.0 0.1 0.3 0.2 0.0 0.1
#[2,] 0.1 0.0 0.5 0.7 1.0 0.9
#[3,] 0.3 0.5 0.0 1.0 0.2 0.3
#[4,] 0.2 0.7 1.0 0.0 0.2 0.5
#[5,] 0.0 1.0 0.2 0.2 0.0 0.7
#[6,] 0.1 0.9 0.3 0.5 0.7 0.0
but then these two points have different distances from other points, e.g. d(1-2) is 0.1 whereas d(5-2) is 1?
Replacing the two 0's does not seem to help though:
d3 <- d2
d3[1,5] <- 0.2
d3[5,1] <- 0.2
dd3 <- cmdscale(as.matrix(d3),k=3)
sum(abs(as.matrix(dist(dd3))-as.matrix(d3)))
#[1] 7.168348
Does this perhaps indicate that not all distance matrices can be reduced to a completely consistent set of points, regardless of how many dimensions one uses?
EDIT 2 possible answer to the last question.
I suspect that the answer is yes. And I was wrong on the number of dimensions, I see now why you need N-1 rather than half that.
If I have a distance d(A-B) = 1, I can represent that in 2-1 = 1 dimensions (x axis), i.e. on a line, placing A in (xA=0) and B in (xB=1).
Then I introduce a third point C and I state that d(A-C) = 2.
I have 3 points, so I need 3-1 = 2 dimensions (xy plane).
The constraint given by d(A-C) is:
(xC - 0)^2 + (yC - 0)^2 = d(A-C)^2 = 4.
i.e. C can be anywhere on a circumference of radius 2 centred in A.
This constrains both xC and yC to be in [-2,2].
However, previously I had not considered that this constrains the possible values of d(B-C), too, because:
d(B-C)^2 = (xC - 1)^2 + (yC - 0)^2
thus, by substitution of the (yC - 0)^2 term:
d(B-C)^2 = (xC - 1)^2 + 4 - (xC - 0)^2 = -2*xC + 5
d(B-C)^2 is therefore bound to [-2*(+2)+5,-2*(-2)+5] = [1,9].
So if my distance matrix contained d(A-B) = 1, d(A-C) = 2 and d(B-C) anywhere outside [1,3], it would configure a system that does not correspond to 3 points in Euclidean space.
At least, I hope this makes sense.
So I guess my original question must be revoked.
I thought I'd leave the reasoning here for future reference or if anyone else should have the same doubt.
Multidimensional scaling creates coordinates for the specified number of dimensions such that they will represent the distances in the original matrix as closely as possible. But the distances will be at different scales. In your example, d3 is the original distance matrix, dd3 is the matrix of coordinates, and dist(dd3) is the distance matrix from the reconstructed coordinates. The values are different, but they reflect the same relationships between points:
d3.v <- as.vector(as.dist(d3)) # Vector of original distances
dd3.v <- as.vector(dist(dd3)) # Vector of distances computed from coordinates
cor(d3.v, dd3.v)
# [1] 0.9433903
plot(d3.v, dd3.v, pch=16)
Related
I have a matrix with 50 rows and 50 columns:
[,1] [,2] [,3]...[,50]
[1,] 1 0.8 0.7
[2,] 0.8 1 0.5
[3,] 0.7 0.5 1
...
[50,]
And I want to sum 0.02 in values up to diagonal to obtain something like this:
[,1] [,2] [,3]...[,50]
[1,] 1 0.82 0.72
[2,] 0.8 1 0.52
[3,] 0.7 0.5 1
...
[50,]
Does anyone know how the sum could be done only in the values that are above the diagonal of the matrix using R?
Example of matrix code:
matrix <- as.matrix(data.frame(A = c(1, 0.8, 0.7), B = c(0.8, 1, 0.5), C = c(0.7, 0.5, 1)), nrow=3, ncol=3)
Try upper.tri like below
matrix[upper.tri(matrix)] <- matrix[upper.tri(matrix)] + 0.02
You can use lower.tri(m) or upper.tri(m) functions in R. Which m is your matrix.
m = matrix(1:36, 6, 6)
m[upper.tri(m)] = m[upper.tri(m)] + 0.02
m
I have matrix dsts with 3 columns; third is a factor. I want my linear plot to be colored by the factor but this command is not working:
plot(dsts[ ,'x'],dsts[,'dist'],col=dsts[,'i'],type='l')
and,
plot(dsts[ ,'x'],dsts[,'dist'],col=dsts[,'i'],type='n')
lines(dsts[ ,'x'],dsts[,'dist'],col=dsts[,'i'])
is not working either!!!
I want to avoid using matplot which accepts matrices.
The col option, though able to take vector input, only effectively controls point colour instead of line colour, so type = "p" works but not pch = "l". For pch = "b", only points will have correct colours.
If you want to have several lines with different colours, you have to plot them with separate plot or lines calls. A better way to go is to reshape your data, then use matplot. It takes a matrix, and plot its columns one by one via a for loop.
Since you've already got a function to reshape data, you have the right way to go.
The reason that plot and lines depreciate vector values in col for line display, is that they have no idea of whether this vector has a reasonable, non-random pattern. They will do something safe, by using only col[1]. I will elaborate on this by two steps.
Firstly, consider this example to see that plot will always use col[1] when type = "l":
set.seed(0); mat1 <- round(cbind(rnorm(9),rnorm(9),rep(1:3, each = 3)), 1)
# [,1] [,2] [,3]
# [1,] 1.3 2.4 1
# [2,] -0.3 0.8 1
# [3,] 1.3 -0.8 1
# [4,] 1.3 -1.1 2
# [5,] 0.4 -0.3 2
# [6,] -1.5 -0.3 2
# [7,] -0.9 -0.4 3
# [8,] -0.3 0.3 3
# [9,] 0.0 -0.9 3
Then we reorder the rows of mat1:
mat2 <- mat1[c(4:9,1:3), ]
# [,1] [,2] [,3]
# [1,] 1.3 -1.1 2
# [2,] 0.4 -0.3 2
# [3,] -1.5 -0.3 2
# [4,] -0.9 -0.4 3
# [5,] -0.3 0.3 3
# [6,] 0.0 -0.9 3
# [7,] 1.3 2.4 1
# [8,] -0.3 0.8 1
# [9,] 1.3 -0.8 1
We use the 3rd column for col, now compare:
par(mfrow = c(1,2))
plot(mat1[,1], mat1[,2], col = mat1[,3], type = "l")
plot(mat2[,1], mat2[,2], col = mat2[,3], type = "l")
mat1[, 3] starts with 1, so the line colour is black; mat2[,3] starts with 2, so the line colour is red.
Now it is time to say why plot and lines depreciate vector col when type = "l". Consider a random row shuffle of mat1:
set.seed(0); mat3 <- mat1[sample(9), ]
# [,1] [,2] [,3]
# [1,] 0.0 -0.9 3
# [2,] 1.3 -0.8 1
# [3,] -0.3 0.3 3
# [4,] 1.3 -1.1 2
# [5,] 0.4 -0.3 2
# [6,] 1.3 2.4 1
# [7,] -0.9 -0.4 3
# [8,] -0.3 0.8 1
# [9,] -1.5 -0.3 2
plot(..., type = "l") will line up points one by one. Be aware that a line of a single colour can only be drawn, if data points on this path have the same colour specification. Now, the 3rd column is completely random: there is no way to line points up with such colour specification.
The best & safest assumption plot and lines can take is that col vector is completely random. Thus, it will only retain col[1] to produce a single colour plot. The full vector will only be used, when type = "p".
Note, the same logic applies to lwd and lty, too. Any argument associated with line display will take only the first vector element. As I said earlier, if you do want to draw several different lines in different styles, do them one by one.
On top of #Zheyuan Li valuable insight on the poblem at hand I wrote a simple function to overcome the problem:
plot_line_color <- function(x,y,fact,lwd=2,...)
{
plot(x,y,type='n')
xy <- cbind(x,y)
invisible(
lapply(1:length(unique(fact)), function(j) {
xy2 <- subset(xy,fact==j)
lines(xy2[ ,1],xy2[,2],col=j,lwd=lwd,...)
})
)
}
A simple simulation:
k <- 1:5
x <- seq(0,10,length.out = 100)
dsts <- lapply(1:length(k), function(i) cbind(x=x, distri=dchisq(x,k[i]),fact=i) )
dsts <- do.call(rbind,dsts)
plot_line_color(x=dsts[,1],y=dsts[,2],fact=dsts[,3])
I have an example matrix:
p <- matrix(c(0.5, 0.3, 0.3, -0.1, 0.6, 0.7, -0.2, -0.1), ncol = 4, byrow = T)
> p
[,1] [,2] [,3] [,4]
[1,] 0.5 0.3 0.3 -0.1
[2,] 0.6 0.7 -0.2 -0.1
with one or more negative elements in every row. The largest element is on the diagonal.
I want to create a function, which substracts row wise the negative values from the diagonal and then sets these elements to zero, so that the row sum is again 1.
I tried it myself with the apply function but had no luck until now.
Hope someone could help me.
Best Wishes
shearer
Here's one way:
negs <- p < 0
diag(p) <- diag(p) + rowSums(replace(p, ! negs, 0))
p[negs] <- 0
# [,1] [,2] [,3] [,4]
# [1,] 0.4 0.3 0.3 0
# [2,] 0.6 0.4 0.0 0
I'm trying to model a system of continuous time Markov chains where in different time intervals I have different rates.
I build a rate matrix for each time period like this
make.rate.matrix <- function(c1, c2, m12, m21) {
matrix(
c(# State 1: lineages in different populations
-(m12+m21), m21, m12, 0,
# State 2: both lineages in population 1
2*m12, -(2*m12+c1), 0, c1,
# State 3: both lineages in population 2
2*m21, 0, -(2*m21+c2), c2,
# State 4: coalesced (catches both populations; absorbing)
0, 0, 0, 0),
byrow = TRUE, nrow=4)
}
(if you are interested it is modelling the coalescence density in a two-deme system with migration)
The rates, the cs and ms, differs in different time periods, so I want to build a rate matrix for each time period and then a transition probability matrix for each period.
With two periods I can specify the rates like this
rates <- data.frame(c1 = c(1,2), c2 = c(2,1), m12 = c(0.2, 0.3), m21 = c(0.4, 0.2))
and I want to use the first rates from time 0 to t and the second set of rates from time t to s, say.
So I want to have a table of rate matrices for the first and second period, and probability transition matrices for moving from state a to b through the first and second period.
mlply(rates, make.rate.matrix)
gives me a list of the two rate matrices, and if I want a table where I can easily look up the rate matrices, I can do something like
> xx <- array(unlist(mlply(rates, make.rate.matrix)), dim=c(4,4,2))
> xx[,,1]
[,1] [,2] [,3] [,4]
[1,] -0.6 0.4 0.2 0
[2,] 0.4 -1.4 0.0 1
[3,] 0.8 0.0 -2.8 2
[4,] 0.0 0.0 0.0 0
> xx[,,2]
[,1] [,2] [,3] [,4]
[1,] -0.5 0.2 0.3 0
[2,] 0.6 -2.6 0.0 2
[3,] 0.4 0.0 -1.4 1
[4,] 0.0 0.0 0.0 0
I can then get the probability transition matrices like
> library(Matrix)
> t <- 1; s <- 2
> P1 <- expm(xx[,,1] * t)
> P2 <- expm(xx[,,2] * (s - t))
but I somehow cannot figure out how to get a table of these like I can get for the rate matrices.
I feel that I should be able to get there with aaply, but I am stomped as to how to get there.
How do I get a table P, where P[,,1] is P1 and P[,,2] is P2?
I found integer and double values behaves differently in matrix and wrong answer returned for double data types only.
#Test
m <- matrix(1:12,4,3)
which(!m[1,] %in% 1:5)
which(!m[1,] %in% 1:5)
[1] 3
However, when I changed the values in double/numeric,
m <- matrix(c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6), 4,3)
which(!m[1,] %in% 0.10:0.35)
[,1] [,2] [,3]
[1,] 0.1 0.5 0.3
[2,] 0.2 0.6 0.4
[3,] 0.3 0.1 0.5
[4,] 0.4 0.2 0.6
which(!m[1,] %in% 0.10:0.35)
[1] 2 3
only 2 should be in the answer because 1,3 are in the range of 0.10 to 0.35, why it is different in the computation using integer and numeric. Thanks!
It's because you have a flawed understanding of what the : operator does. : does not indicate a range, but is indeed a shortcut to generate sequences of discrete values (at integer intervals).
Compare:
> 1:5
[1] 1 2 3 4 5
> 0.1:0.35
[1] 0.1
So your first bit of code tests whether a value is %in% the range of integers 1 to 5. But the second bit of code tests whether your data is equal to 0.1.
To get the result you are after, you need to write the following:
which(!(m[1, ] >= 0.1 & m[1, ] <= 0.35))
[1] 2