Error in isoMDS(d): zero or negative distance between objects - r

I'm trying to do a nonmetric MDS (R version 3.3.3) using the isoMDS function in the MASS package and I get this error:
Error in isoMDS(d): zero or negative distance between objects 1 and 2
Here's an example of what I'm doing:
# LOAD LIBRARY
library(MASS)
# CREATE FAKE DATA
a <- c(1, 1, 1, 1)
b <- c(2, 2, 2, 2)
c <- c(3, 3, 4, 5)
d <- c(4, 4, 7, 9)
x <- data.frame(a, b, c, d)
x
a b c d
1 1 2 3 4
2 1 2 3 4
3 1 2 4 7
4 1 2 5 9
# EUCLIDEAN DISTANCE BETWEEN ROWS 1, 2, 3 and 4
d <- dist(x)
d
1 2 3
2 0.000000
3 3.162278 3.162278
4 5.385165 5.385165 2.236068
# NMDS
fit <- isoMDS(d)
Error in isoMDS(d) : distance négative ou nulle entre les objets 1 et 2
I don't know if there's a way of getting around this issue or if I'm doing something wrong. I understand that objects 1 and 2 are identical and that that's probably why the distance is negative or equals to zero. I found out that my question was a "FAQ", but one of the only answers I found is this:
Short answer: you cannot compare distances including NAs, so there is no
way to find a monotone mapping of distances. If the data really are identical for two rows, you can easily drop one of
them whilst doing MDS, and then assign the position found for one to the
other.
So, my next questions are: how do you drop rows whilst doing MDS, and is there any other way to perform a NMDS?
Any help would be greatly appreciated!

The dist function computes the distances between the rows of a data matrix.
Your a, b, c, and d vectors are the columns of the x matrix, not the rows.
A simple solution is to transpose x:
library(MASS)
a <- c(1, 1, 1, 1)
b <- c(2, 2, 2, 2)
c <- c(3, 3, 4, 5)
d <- c(4, 4, 7, 9)
x <- data.frame(a, b, c, d)
# Calculate distance between the columns
d <- dist(t(x))
# NMDS
fit <- isoMDS(d)
# initial value 0.000000
# final value 0.000000
# converged
fit
# $points
# [,1] [,2]
# a -4.594429 0.4509513
# b -2.770312 -0.3638885
# c 1.098884 -0.3114594
# d 6.265857 0.2243966
#
# $stress
# [1] 7.976932e-15
I hope it can help you.

As you noted, you have identical rows.
You can omit identical rows when you first create the distance matrix
d <- dist(x[-1,])
Then continue as normal
fit <- isoMDS(d)

Alternatively, you could try the vegan::metaMDS function:
library(vegan)
#> This is vegan 2.5-3
x <- data.frame(a = c(1, 1, 1, 1),
b = c(2, 2, 2, 2),
c = c(3, 3, 4, 5),
d = c(4, 4, 7, 9))
# The warnings are expected for such a small dataset
fit <- vegan::metaMDS(comm = dist(x))
#> ... Procrustes: rmse 0.09543314 max resid 0.108719
#> *** No convergence -- monoMDS stopping criteria:
#> 17: stress < smin
#> 3: scale factor of the gradient < sfgrmin
#> Warning in vegan::metaMDS(comm = dist(x)): stress is (nearly) zero: you may
#> have insufficient data
ordiplot(fit, type = "text")
Variables/columns "a" and "b" (1 and 2) get the same coordinates.
Similarly, using the smacof::mds function:
library(smacof)
fit2 <- smacof::mds(delta = dist(x), type = "ordinal")
fit2$conf
#> D1 D2
#> 1 0.5742535 0.007220978 # 1 & 2 get the same coordinates
#> 2 0.5742535 0.007220978
#> 3 -0.2749314 -0.034928060
#> 4 -0.8735757 0.020486105

Related

Writing a summation formula using variables from multiple observations

I am trying to create a new variable for each observation using the following formula:
Index = ∑(BAj / DISTANCEij)
where:
j = focal observation; i= other observation
Basically, I'm taking the focal individual (i) and finding the euclidean distance between it and another point and dividing the other points BA by that distance. Do that for all the other points and then sum them all and repeat all of this for each point.
Here is some sample data:
ID <- 1:4
BA <- c(3, 5, 6, 9)
x <- c(0, 2, 3, 7)
y <- c(1, 3, 4, 9)
df <- data.frame(ID, BA, x, y)
print(df)
ID BA x y
1 1 3 0 1
2 2 5 2 3
3 3 6 3 4
4 4 9 7 9
Currently, I've extracted out vectors and created a formula to calculate part of the formula shown here:
vec1 <- df[1, ]
vec2 <- df[2, ]
dist <- function(vec1, vec2) vec1$BA/sqrt((vec2$x - vec1$x)^2 +
(vec2$y - vec1$y)^2)
My question is how do I repeat this with the x and y values for vec2 changing for each new other point with vec1 remaining the same and then sum them all together?
We may loop over the row sequence, extract the data and apply the dist function
library(dplyr)
library(purrr)
df %>%
mutate(dist_out = map_dbl(row_number(), ~ {
othr <- cur_data()[-.x,]
cur <- cur_data()[.x, ]
sum(dist(cur, othr))
}))
-output
ID BA x y dist_out
1 1 3 0 1 2.049983
2 2 5 2 3 5.943485
3 3 6 3 4 6.593897
4 4 9 7 9 3.404545
Here are two base R ways.
1. for loop
ID <- 1:4
BA <- c(3, 5, 6, 9)
x <- c(0, 2, 3, 7)
y <- c(1, 3, 4, 9)
df <- data.frame(ID, BA, x, y)
n <- nrow(df)
d <- dist(df[c("x", "y")], upper = TRUE)
d <- as.matrix(d)
Index <- numeric(n)
for(j in seq_len(n)) {
d_j <- d[-j, j, drop = TRUE]
Index[j] <- sum(df$BA[j]/d_j)
}
Index
#> [1] 2.049983 5.943485 6.593897 3.404545
Created on 2022-08-18 by the reprex package (v2.0.1)
2. sapply loop
Index <- sapply(seq_len(n), \(j) sum(df$BA[j]/d[-j, j, drop = TRUE]))
Index
#> [1] 2.049983 5.943485 6.593897 3.404545
Created on 2022-08-18 by the reprex package (v2.0.1)

Back tracing parents/paths of two-column data of a tree

I have a tree data serialized like the following:
Relationship: P to C is "one-to-many", and C to P is "one-to-one". So column P may have duplicate values, but column C has unique values.
P, C
1, 2
1, 3
3, 4
2, 5
4, 6
# in data.frame
df <- data.frame(P=c(1,1,3,2,4), C=c(2,3,4,5,6))
1. How do I efficiently implement a function func so that:
func(df, val) returns a vector of full path to root (1 in this case).
For example:
func(df, 3) returns c(1,2,3)
func(df, 5) returns c(1,2,5)
func(df, 6) returns c(1,3,4,6)
2. Alternatively, quickly transforming df to a lookup table like this also works for me:
C, Paths
2, c(1,2)
3, c(1,3)
4, c(1,3,4)
5, c(1,2,5)
6, c(1,2,4,6)
Here is a solution using igraph
library(igraph)
g <- graph_from_data_frame(df)
df <- within(df,
Path <- sapply(match(as.character(C),names(V(g))),
function(k) toString(names(unlist(all_simple_paths(g,1,k))))))
such that
> df
P C Path
1 1 2 1, 2
2 1 3 1, 3
3 3 4 1, 3, 4
4 2 5 1, 2, 5
5 4 6 1, 3, 4, 6

R: Grouping data within cetrain range

I have a data frame with two columns, let's call them X and Y. Here's an example of it:
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
which produces this:
X Y
A 14
B 12
C 12
D 11
E 9
F 6
G 4
H 1
Note that the data frame will always be ordered in a descending order based on Y. I want to group together cases where the Y values lie within a certain range, while updating the X column to reflect the grouping too. For example, if the value is 2, I would like the final output to be:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F G 5.00000
H 1.00000
Let me explain how I got that. From the starting df data frame, the closest values were B and C. Joining them would result in:
X new_Y
A 14
B C 12
D 11
E 9
F 6
G 4
H 1
The new_Y value for cases B and C is the average of the original values for B and C i.e. 12. From this second data frame, B C are within 2 from D so they are the next to be grouped together:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F 6.00000
G 4.00000
H 1.00000
Note that the Y value for B C D is 11.67 because the original values of B, C and D were 12, 12 and 11 respectively and their average is 11.667. I wouldn't want the code to return the average Y from the previous iteration (which in this case would be 11.5).
Finally, F and G can also be grouped together, producing the final output stated above.
I'm not sure of the code needed to achieve this. My only thoughts were to calculate the distance from the previous and following element, look for the minimum and check whether it exceeds the threshold value (of 2 in the example above). Based on where that minimum appears, join the X column while averaging the Y values from the original table. Repeat this until the minimum becomes larger than the threshold.
But I'm not sure how to write the necessary code to achieve this or whether there's a more efficient solution to the algorithm I'm suggesting above. Any help will be much appreciated.
P.S I forgot to mention that if the distance between the previous and the following Y value is the same, then the grouping should be done towards the larger Y value. So
X Y
A 10
B 8
C 6
would be returned as
X new_Y
A B 9
C 6
Thanks in advance for your patience. My apologies if I didn't explain this very well.
This sounds like hierarchical agglomerative clustering.
To get the groups, use dist, hclust and cutree.
Note that centroid clustering with hclust expects the distances as the square of the Euclidean distance.
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
dCutoff <- 2
d2 <- dist(df$Y)^2
hc <- hclust(d2, method = "centroid")
group_id <- cutree(hc, h = dCutoff^2)
group_id
#> [1] 1 2 2 2 3 4 4 5
To munge the original table, we can use dplyr.
library('dplyr')
df %>%
group_by(group_id = group_id) %>%
summarise(
X = paste(X, collapse = ' '),
Y = mean(Y))
#> # A tibble: 5 x 3
#> group_id X Y
#> <int> <chr> <dbl>
#> 1 1 A 14.00000
#> 2 2 B C D 11.66667
#> 3 3 E 9.00000
#> 4 4 F G 5.00000
#> 5 5 H 1.00000
This gives the average of the previous iteration though. In any case I hope it helps
library(data.table)
df <- data.table(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
while (minimum < 2){
index <- which(df$difference==minimum) # see where the minimum occurs
check = FALSE
# because the last row cannot have a number since there is not an element after that
# we need to see if this element has the minimum difference with its previous
# if it does not have the minimum difference then we exclude it and paste it later
if(df[nrow(df)-1,difference]!=minimum){
last_row <- df[nrow(df)]
df <- df[-nrow(df)]
check = TRUE
}
tmp <- df[(index:(index+1))]
df <- df[-(index:(index+1))]
to_bind <- data.table(X = paste0(tmp$X, collapse = " "))
to_bind$Y <- mean(tmp$Y)
df <- rbind(df[,.(X,Y)],to_bind)
if(check){
df <- rbind(df,last_row[,.(X,Y)])
}
setorder(df,-Y)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
}

Sorting by successive vectors in R [duplicate]

I have a vector x, that I would like to sort based on the order of values in vector y. The two vectors are not of the same length.
x <- c(2, 2, 3, 4, 1, 4, 4, 3, 3)
y <- c(4, 2, 1, 3)
The expected result would be:
[1] 4 4 4 2 2 1 3 3 3
what about this one
x[order(match(x,y))]
You could convert x into an ordered factor:
x.factor <- factor(x, levels = y, ordered=TRUE)
sort(x)
sort(x.factor)
Obviously, changing your numbers into factors can radically change the way code downstream reacts to x. But since you didn't give us any context about what happens next, I thought I would suggest this as an option.
How about?:
rep(y,table(x)[as.character(y)])
(Ian's is probably still better)
In case you need to get order on "y" no matter if it's numbers or characters:
x[order(ordered(x, levels = y))]
4 4 4 2 2 1 3 3 3
By steps:
a <- ordered(x, levels = y) # Create ordered factor from "x" upon order in "y".
[1] 2 2 3 4 1 4 4 3 3
Levels: 4 < 2 < 1 < 3
b <- order(a) # Define "x" order that match to order in "y".
[1] 4 6 7 1 2 5 3 8 9
x[b] # Reorder "x" according to order in "y".
[1] 4 4 4 2 2 1 3 3 3
[Edit: Clearly Ian has the right approach, but I will leave this in for posterity.]
You can do this without loops by indexing on your y vector. Add an incrementing numeric value to y and merge them:
y <- data.frame(index=1:length(y), x=y)
x <- data.frame(x=x)
x <- merge(x,y)
x <- x[order(x$index),"x"]
x
[1] 4 4 4 2 2 1 3 3 3
x <- c(2, 2, 3, 4, 1, 4, 4, 3, 3)
y <- c(4, 2, 1, 3)
for(i in y) { z <- c(z, rep(i, sum(x==i))) }
The result in z: 4 4 4 2 2 1 3 3 3
The important steps:
for(i in y) -- Loops over the elements of interest.
z <- c(z, ...) -- Concatenates each subexpression in turn
rep(i, sum(x==i)) -- Repeats i (the current element of interest) sum(x==i) times (the number of times we found i in x).
Also you can use sqldf and do it by a join function in sql likes the following:
library(sqldf)
x <- data.frame(x = c(2, 2, 3, 4, 1, 4, 4, 3, 3))
y <- data.frame(y = c(4, 2, 1, 3))
result <- sqldf("SELECT x.x FROM y JOIN x on y.y = x.x")
ordered_x <- result[[1]]

Including squared predictors in model matrix

I have the following code
x <- c(1, 2, 3)
y <- c(2, 3, 4)
z <- c(3, 4, 5)
df <- data.frame(x, y, z)
model.matrix(x ~ .^4, df)
This gives me a model matrix with predictors $y, z$, and $y:z$. However, I also want y^2 and z^2, and want to use a solution that uses "$.$", since I have lots of other predictors beyond $y$ and $z$. What's the best way to approach this?
Try this:
> x <- c(1, 2, 3)
> y <- c(2, 3, 4)
> z <- c(3, 4, 5)
> df <- data.frame(x, y, z)
>
> #Assuming that your 1st column is the response variable, then I excluded it to have
> #just the independent variables as a new data.frame called df.2
> df.2=df[,-1]
> model.matrix(x ~ .^4+I(df.2^2), df)
(Intercept) y z I(df.2^2)y I(df.2^2)z y:z
1 1 2 3 4 9 6
2 1 3 4 9 16 12
3 1 4 5 16 25 20
attr(,"assign")
[1] 0 1 2 3 3 4

Resources