Find nearest 3 points to a given point in a dataframe - r

I've got a dataframe with latitude and longitude, which looks like this:
x y set
61 -112
63 -113
61 -113
62 -111 point
61 -111
64 -120
I want to find the three closest points to the point that is marked as point in column set. Then, for these three closest points, I want to amend the column set to say closest. Like this:
x y set
61 -112 closest
65 -113
62 -113 closest
62 -111 point
62 -111 closest
64 -120
How can I do this?

dists <- geosphere::distHaversine(dat[dat$set=="point",c("y","x")], dat[,c("y","x")])
dists
# [1] 123339.4 151513.9 153862.4 0.0 111319.5 505814.4
dat$set[dat$set != "point" & rank(dists) < 5] <- "closest"
dat
# x y set
# 1 61 -112 closest
# 2 63 -113 closest
# 3 61 -113
# 4 62 -111 point
# 5 61 -111 closest
# 6 64 -120
The reason we use < 5 is that the own-distance (point to point) will be the closest (0), so we need ranks 2-4. This assumes there is one "point"; if there are more, you'll likely want outer (to produce a matrix of distances) and look at each row before populating $set.
I'm inferring latitude and longitude from the sp tag, so chose the Haversine distance calculation since it's fast, and the appearance of coarse coordinates does not suggest the requirement for sub-millimeter accuracy (i.e., Vincenty Ellipsoid formula). There are other distance calculations if needed.

Here is first another approach with geosphere (make a distance matrix with distm) and then I show how you can use the terra::nearby method (which works for both long/lat and planar coordinates).
m <- matrix(c(61, -112, 63, -113, 61, -113, 62, -111, 61, -111, 64, -120), ncol=2, byrow=TRUE)
# note that the order should be long/lat !!!
m <- m[, 2:1]
d <- geosphere::distm(m)
diag(d) <- NA
i <- order(d[4,])[1:3]
i
#[1] 5 1 2
m[i,]
# [,1] [,2]
#[1,] -111 61
#[2,] -112 61
#[3,] -113 63
Now with terra. The below gets the nearest 3 neighbors for all points.
library(terra)
v <- vect(m, crs="+proj=lonlat")
nearby(v, k=3)
# id k1 k2 k3
#1 1 3 5 4
#2 2 4 3 1
#3 3 1 5 4
#4 4 5 1 2
#5 5 1 3 4
#6 6 2 3 4
With terra version 1.3.15 (currently the development version) you can also do
nearby(v[4,], v, k=4)
# id k1 k2 k3 k4
#[1,] 1 4 5 1 2
Taking k=4 neighbors as the first one is the point itself.
To get the development version, do
install.packages('terra', repos='https://rspatial.r-universe.dev')

Related

How can I extract a part of a vector to another vector (including positions)

I have a vector with different values (positive and negative), so, I want to select only the 10 lowest odd number values, and the 10 lowest pair values. Help me, please!
This is a way to do it using base R.
vector with odd and even numbers
x <- sample(-100:100, 30)
The modulus operator in R help to get the job done. You can use it this way
c(
# Extract the lowest even numbers
head(sort(x[x %% 2 == 0]), 5),
# Extract the lowest odds numbers
head(sort(x[x %% 2 == 1]), 5)
)
Given vector vas your input vector, you can obtain the desired output (including positions) via the following code
names(v) <- seq_along(v)
# lowest 10 odd numbers
low_odd <- sort(v[v%%2==1])[1:10]
# positions of those odd numbers in v
low_odd_pos <- as.numeric(names(low_odd))
# lowest 10 even numbers
low_even <- sort(v[v%%2==0])[1:10]
# positions of those even numbers in v
low_even_pos <- as.numeric(names(low_even))
Example
set.seed(1)
v <- sample(-50:50)
then
> low_odd
43 101 39 95 85 72 7 73 45 29
-49 -47 -45 -43 -41 -39 -37 -35 -33 -31
> low_odd_pos
[1] 43 101 39 95 85 72 7 73 45 29

Problem with Ruining c-mean clustering on my data in R program

For this data how to fix this problem
> x=data.frame(c(v1="a" ,"b" ,"c" ,"d" ,"e"),
+ v2=c(97 ,90 ,93 ,97 ,90),
+ v3=c( 85 ,91 ,87 ,91 ,93))
> library(e1071)
> f <- cmeans(x, 2)
Error in cmeans(x, 2) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In cmeans(x, 2) : NAs introduced by coercion
2: In cmeans(x, 2) : NAs introduced by coercion
> f
I want to apply c-maen to my data as is illustrated code in above, it contains three vectors: v1,v2 ,v2 I want to apply c-mean label by vector (v1)
If we look at the documentation of ?cmeans,
x - The data matrix where columns correspond to variables and rows to observations.
So, we can convert the data.frame to matrix after removing the character column (1st column)
x1 <- as.matrix(x[-1])
row.names(x1) <- x[,1]
cmeans(x1, 2)
#Fuzzy c-means clustering with 2 clusters
#Cluster centers:
# v2 v3
#1 90.30090 91.85191
#2 95.75436 87.22535
#Memberships:
# 1 2
#a 0.06614213 0.93385787
#b 0.98305641 0.01694359
#c 0.19855988 0.80144012
#d 0.25730888 0.74269112
#e 0.97924422 0.02075578
#Closest hard clustering:
#a b c d e
#2 1 2 2 1
#Available components:
#[1] "centers" "size" "cluster" "membership" "iter" "withinerror" "call"
The k-mean family of partitional clustering algorithm works on the principle of mean which by its nature will accept only numeric values. You are getting an error because, the dataframe consist of both numeric and categorical values, which c-mean() does not like. Also, there is no need to convert the dataframe to matrix because that is not the actual problem.
Therefore,
Alternative approach
Discretize the character variable to assign it numbers and then apply clustering. This way there is no need to drop any variable.
# create empty data frame
df<- setNames(data.frame(matrix(ncol = 5, nrow = 5)), c("a" ,"b" ,"c" ,"d" ,"e"))
# fill values
df$a<- c("aaaa" ,"bbbb" ,"cccc" ,"dddd" ,"eeee")
df$b<- c(97 ,90 ,93 ,97 ,90)
df$c<- c(97 ,90 ,93 ,97 ,90)
df$d<- c( 85 ,91 ,87 ,91 ,93)
df$e<- c( 85 ,91 ,87 ,91 ,93)
# show the dataframe
df
a b c d e
1 aaaa 97 97 85 85
2 bbbb 90 90 91 91
3 cccc 93 93 87 87
4 dddd 97 97 91 91
5 eeee 90 90 93 93
# Discretize the character variable
df$a <- as.numeric( factor(df$a) ) -1
df
a b c d e
1 0 97 97 85 85
2 1 90 90 91 91
3 2 93 93 87 87
4 3 97 97 91 91
5 4 90 90 93 93
# Apply clustering
library(e1071)
cmeans(df, 2)
Fuzzy c-means clustering with 2 clusters
Cluster centers:
a b c d e
1 1.406 95.72 95.72 87.18 87.18
2 2.510 90.36 90.36 91.85 91.85
Memberships:
1 2
[1,] 0.92728 0.07272
[2,] 0.04014 0.95986
[3,] 0.80061 0.19939
[4,] 0.72009 0.27991
[5,] 0.03544 0.96456
Closest hard clustering:
[1] 1 2 1 1 2
Available components:
[1] "centers" "size" "cluster" "membership" "iter"
[6] "withinerror" "call"

Efficient Way To Find Neighboring Coordinates In R

I am working with a matrix data set that has X-Y coordinates, and rest of the columns have logical values containing different parameters. I want to find the neighboring coordinates of X-Y given at least one of the corresponding parameters is true, and then append it to new matrix as rows. Below is the sample matrix data.
Data_1
X Y P1 P2 P3 P4
-52 32 1 0 0 1
-50 34 0 0 0 0
-50 26 0 0 0 1
-52 31 0 1 1 1
To solve this, I am planning to use following algorithm:
Algorithm
# Find row wise sum
newCol <- rowSums(Data_1)
# Bind as first column with Data_1
newData <- cbind(newCol, Data_1)
# Not R code, pseduo code
if (newData[,1] != 0{
store newData[,2] and newData[,3].
Data_2 <- find neighboring coordinates to newData[,2] and newData[,3].
}
finalData <- cbind(Data_1, Data_2)
Output
X Y P1 P2 P3 P4 N1.x N1.y N2.x N2.y N3.x N3.y N4.x N4.y N5.x N5.y N6.x N6.y N7.x N7.y N8.x N8.y
-52 32 1 0 0 1 <Neighboring Coordinates---->
-50 34 0 0 0 0 <NULL>
-50 26 0 0 0 1 <Neighboring Coordinates---->
-52 31 0 1 1 1 <Neighboring Coordinates---->
The problem with this approach is scalability when the matrix will have millions of rows and columns.
Following image shows neighbor coordinates for (x,y).
Please suggest better approach if possible, thanks.
How about a data frame approach--does it need to be a matrix?
# Create one data frame with the starting points
points <- data.frame(x = c(-52, -50, -50, -52),
y = c( 32, 34, 26, 31))
# Create a second data frame with the desired combinations of distances
distances <- expand.grid(xd = 1:4,
yd = 1:4)
# Repeat the distances for each point (cartesian product/outer join)
neighbors <- merge(points, distances)
# Compute neighbor coordinates
neighbors$nx <- neighbors$x + neighbors$xd
neighbors$ny <- neighbors$y + neighbors$yd
# sort
neighbors <- neighbors[order(neighbors$x, neighbors$y), ]
# display
head(neighbors)
Result
x y xd yd nx ny
4 -52 31 1 1 -51 32
8 -52 31 2 1 -50 32
12 -52 31 3 1 -49 32
16 -52 31 4 1 -48 32
20 -52 31 1 2 -51 33
24 -52 31 2 2 -50 33

R: applying a function on whole dataset to find points within a circle

I have a difficulty with application of the data frame on my function in R. I have a data.frame with three columns ID of a point, its location on x axis and its location on y axis. All I need to do is to find for a given point IDs of points that lies in its neighborhood. I've made the function that shows whether the point lies within a circle where the center is a location of observed point and returns it's ID if true.
Here is my code:
point_id <- locationdata$point_id
x_loc <- locationdata$x_loc
y_loc <- locationdata$y_loc
locdata <- data.frame(point_id, x_loc, y_loc)
#radius set to1km
incircle3 <- function(x_loc, y_loc, center_x, center_y, pointid, r = 1000000){
dx = (x_loc-center_x)
dy = (y_loc-center_y)
if (b <- dx^2 + dy^2 <= r^2){
print(shopid)} ##else {print('')}
}
Unfortunately I don't know how to apply this function on the whole data frame. So once I enter the locations of the observed point it would return me IDs of all points that lies in the neighborhood. Ideally I would need to find this relation for all the points automatically. So it would return me the points that lies in the neighborhood of each point from the dataset. Previously I have been inserting the center_x and center_y manually.
Thank you very much for your advices in advance!
You can tackle this with R's dist function:
# set the random seed and create some dummy data
set.seed(101)
dummy <- data.frame(id=1:100, x=runif(100), y=runif(100))
> head(dummy)
id x y
1 1 0.37219838 0.12501937
2 2 0.04382482 0.02332669
3 3 0.70968402 0.39186128
4 4 0.65769040 0.85959857
5 5 0.24985572 0.71833452
6 6 0.30005483 0.33939503
Call the dist function which returns a dist object. The default distance metric is Euclidean which is what you have coded in your question.
dists <- dist(dummy[,2:3])
Loop over the distance matrix and return the indices for each id that are within some constant distance:
neighbors <- apply(as.matrix(dists), 1, function(x) which(x < 0.33))
> neighbors[[1]]
1 6 7 8 19 23 30 32 33 34 42 44 46 51 55 87 88 91 94 99
Here's a modification to handle volatile ids:
set.seed(101)
dummy <- data.frame(id=sample(1:100, 100), x=runif(100), y=runif(100))
> head(dummy)
id x y
1 38 0.12501937 0.60567568
2 5 0.02332669 0.56259740
3 70 0.39186128 0.27685556
4 64 0.85959857 0.22614243
5 24 0.71833452 0.98355758
6 29 0.33939503 0.09838715
dists <- dist(dummy[,2:3])
neighbors <- apply(as.matrix(dists), 1, function(x) {
dummy$id[which(x < 0.33)]
})
names(neighbors) <- dummy$id
> neighbors[['38']]
[1] 38 5 55 80 63 76 17 71 47 11 88 13 41 21 36 31 73 61 99 59 39 89 94 12 18 3

Retrieving adjaceny values in a nng igraph object in R

edited to improve the quality of the question as a result of the (wholly appropriate) spanking received by Spacedman!
I have a k-nearest neighbors object (an igraph) which I created as such, by using the file I have uploaded here:
I performed the following operations on the data, in order to create an adjacency matrix of distances between observations:
W <- read.csv("/path/sim_matrix.csv")
W <- W[, -c(1,3)]
W <- scale(W)
sim_matrix <- dist(W, method = "euclidean", upper=TRUE)
sim_matrix <- as.matrix(sim_matrix)
mygraph <- nng(sim_matrix, k=10)
This give me a nice list of vertices and their ten closest neighbors, a small sample follows:
1 -> 25 26 28 30 32 144 146 151 177 183 2 -> 4 8 32 33 145 146 154 156 186 199
3 -> 1 25 28 51 54 106 144 151 177 234 4 -> 7 8 89 95 97 158 160 170 186 204
5 -> 9 11 17 19 21 112 119 138 145 158 6 -> 10 12 14 18 20 22 147 148 157 194
7 -> 4 13 123 132 135 142 160 170 173 174 8 -> 4 7 89 90 95 97 158 160 186 204
So far so good.
What I'm struggling with, however, is how to to get access to the values for the weights between the vertices that I can do meaningful calculations on. Shouldn't be so hard, this is a common thing to want from graphs, no?
Looking at the documentation, I tried:
degree(mygraph)
which gives me the sum of the weights for each node. But I don't want the sum, I want the raw data, so I can do my own calculations.
I tried
get.data.frame(mygraph,"E")[1:10,]
but this has none of the distances between nodes:
from to
1 1 25
2 1 26
3 1 28
4 1 30
5 1 32
6 1 144
7 1 146
8 1 151
9 1 177
10 1 183
I have attempted to get values for the weights between vertices out of the graph object, that I can work with, but no luck.
If anyone has any ideas on how to go about approaching this, I'd be grateful. Thanks.
It's not clear from your question whether you are starting with a dataset, or with a distance matrix, e.g. nng(x=mydata,...) or nng(dx=mydistancematrix,...), so here are solutions with both.
library(cccd)
df <- mtcars[,c("mpg","hp")] # extract from mtcars dataset
# knn using dataset only
g <- nng(x=as.matrix(df),k=5) # for each car, 5 other most similar mpg and hp
V(g)$name <- rownames(df) # meaningful names for the vertices
dm <- as.matrix(dist(df)) # full distance matrix
E(g)$weight <- apply(get.edges(g,1:ecount(g)),1,function(x)dm[x[1],x[2]])
# knn using distance matrix (assumes you have dm already)
h <- nng(dx=dm,k=5)
V(h)$name <- rownames(df)
E(h)$weight <- apply(get.edges(h,1:ecount(h)),1,function(x)dm[x[1],x[2]])
# same result either way
identical(get.data.frame(g),get.data.frame(h))
# [1] TRUE
So these approaches identify the distances from each vertex to it's five nearest neighbors, and set the edge weight attribute to those values. Interestingly, plot(g) works fine, but plot(h) fails. I think this might be a bug in the plot method for cccd.
If all you want to know is the distances from each vertex to the nearest neighbors, the code below does not require package cccd.
knn <- t(apply(dm,1,function(x)sort(x)[2:6]))
rownames(knn) <- rownames(df)
Here, the matrix knn has a row for each vertex and columns specifying the distance from that vertex to it's 5 nearest neighbors. It does not tell you which neighbors those are, though.
Okay, I've found a nng function in cccd package. Is that it? If so.. then mygraph is just an igraph object and you can just do E(mygraph)$whatever to get the names of the edge attributes.
Following one of the cccd examples to create G1 here, you can get a data frame of all the edges and attributes thus:
get.data.frame(G1,"E")[1:10,]
You can get/set individual edge attributes with E(g)$whatever:
> E(G1)$weight=1:250
> E(G1)$whatever=runif(250)
> get.data.frame(G1,"E")[1:10,]
from to weight whatever
1 1 3 1 0.11861240
2 1 7 2 0.06935047
3 1 22 3 0.32040316
4 1 29 4 0.86991432
5 1 31 5 0.47728632
Is that what you are after? Any igraph package tutorial will tell you more!

Resources