Find local minimum in a vector with r - r

Taking the ideas from the following links:
the local minimum between the two peaks
How to explain ...
I look for the local minimum or minimums, avoiding the use of functions already created for this purpose [max / min locale or global].
Our progress:
#DATA
simulate <- function(lambda=0.3, mu=c(0, 4), sd=c(1, 1), n.obs=10^5) {
x1 <- rnorm(n.obs, mu[1], sd[1])
x2 <- rnorm(n.obs, mu[2], sd[2])
return(ifelse(runif(n.obs) < lambda, x1, x2))
}
data <- simulate()
hist(data)
d <- density(data)
#
#https://stackoverflow.com/a/25276661/8409550
##Since the x-values are equally spaced, we can estimate dy using diff(d$y)
d$x[which.min(abs(diff(d$y)))]
#With our data we did not obtain the expected value
#
d$x[which(diff(sign(diff(d$y)))>0)+1]#pit
d$x[which(diff(sign(diff(d$y)))<0)+1]#peak
#we check
#1
optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum
optimize(approxfun(d$x,d$y),interval=c(0,4),maximum = TRUE)$maximum
#2
tp <- pastecs::turnpoints(d$y)
summary(tp)
ind <- (1:length(d$y))[extract(tp, no.tp = FALSE, peak = TRUE, pit = TRUE)]
d$x[ind[2]]
d$x[ind[1]]
d$x[ind[3]]
My questions and request for help:
Why did the command lines fail:
d$x[which.min(abs(diff(d$y)))]
It is possible to eliminate the need to add one to the index in the command lines:
d$x[which(diff(sign(diff(d$y)))>0)+1]#pit
d$x[which(diff(sign(diff(d$y)))<0)+1]#peak
How to get the optimize function to return the two expected maximum values?

Question 1
The answer to the first question is straighforward. The line d$x[which.min(abs(diff(d$y)))] asks for the x value at which there was the smallest change in y between two consecutive points. The answer is that this happened at the extreme right of the plot where the density curve is essentially flat:
which.min(abs(diff(d$y)))
#> [1] 511
length(abs(diff(d$y)))
#> [1] 511
This is not only smaller than the difference at your local maxima /minima points; it is orders of magnitude smaller. Let's zoom in to the peak value of d$y, including only the peak and the point on each side:
which.max(d$y)
#> [1] 324
plot(d$x[323:325], d$y[323:325])
We can see that the smallest difference is around 0.00005, or 5^-5, between two consecutive points. Now look at the end of the plot where it is flattest:
plot(d$x[510:512], d$y[510:512])
The difference is about 1^-7, which is why this is the flattest point.
Question 2
The answer to your second question is "no, not really". You are taking a double diff, which is two elements shorter than x, and if x is n elements long, a double diff will correspond to elements 2 to (n - 1) in x. You can remove the +1 from the index, but you will have an off-by-one error if you do that. If you really wanted to, you could concatenate dummy zeros at each stage of the diff, like this:
d$x[which(c(0, diff(sign(diff(c(d$y, 0))))) > 0)]
which gives the same result, but this is longer, harder to read and harder to justify, so why would you?
Question 3
The answer to the third question is that you could use the "pit" as the dividing point between the minimum and maximum value of d$x to find the two "peaks". If you really want a single call to get both at once, you could do it inside an sapply:
pit <- optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum
peaks <- sapply(1:2, function(i) {
optimize(approxfun(d$x, d$y),
interval = c(min(d$x), pit, max(d$x))[i:(i + 1)],
maximum = TRUE)$maximum
})
pit
#> [1] 1.691798
peaks
#> [1] -0.02249845 3.99552521

Related

Use R to Efficiently Order Randomly Generated Transects

Problem
I looking for a way to efficiently order randomly selected sampling transects occur around a fixed object. These transects, once generated, need to be ordered in a way that makes sense spatially such that the distance traveled is minimized. This would be done by ensuring that the end point of the current transect is as close as possible to the start point of the next transect. Also, none of the transects can be repeated.
Because there are thousands of transects to order, and this is a very tedious task to do manually, and I am trying to use R to automate this process. I have already generated the transects, each having a start and end point whose location is indicated using a 360-degree system (e.g., 0 is North, 90 is East, 180 is South, and 270 is West). I have also generated some code that seems to indicate the start point and the ID for the next transect, but there are a few problems with this code: (1) it can generate errors depending on the start and end points being considered, (2) it doesn't achieve what I ultimately need it to achieve, and (3) as is, the code itself seems overly complicated and I can't help but wonder if there is a more straightforward way to do this.
Ideally, the code would result in the transects being reordered such they match the order that they should be flown rather than the order that they were initially input.
The Data
For simplicity, let's pretend there are just 10 transects to order.
# Transect ID for the start point
StID <- c(seq(1, 10, 1))
# Location of transect start point, based on a 360-degree circle
StPt <- c(342.1, 189.3, 116.5, 67.9, 72, 208.4, 173.2, 97.8, 168.7, 138.2)
# Transect ID for the end point
EndID <- c(seq(1, 10, 1))
# Location of transect start point, based on a 360-degree circle
EndPt <- c(122.3, 313.9, 198.7, 160.4, 166, 26.7, 312.7, 273.7, 288.8, 287.5)
# Dataframe
df <- cbind.data.frame(StPt, StID, EndPt, EndID)
What I Have Tried
Please feel free to ignore this code, there has to be a better way and it does not really achieve the intended outcome. Right now I am using a nested for-loop that is very difficult to intuitively follow but represents my best attempt thus far.
# Create two new columns that will be populated using a loop
df$StPt_Next <- NA
df$ID_Next <- NA
# Also create a list to be populated as end and start points are matched
used <- c(df$StPt[1]) #puts the start point of transect #1 into the used vector since we will start with 1 and do not want to have it used again
# Then, for every row in the dataframe...
for (i in seq(1,length(df$EndPt)-1, 1)){ # Selects all rows except the last one as the last transect should have no "next" transect
# generate some print statements to indicate that the script is indeed running while you wait....
print(paste("######## ENDPOINT", i, ":", df$EndPt[i], " ########"))
print(paste("searching for a start point that fits criteria to follow this endpoint",sep=""))
# sequentially select each end point
valueEndPt <- df[i,1]
# and order the index by taking the absolute difference of end and start points and, if this value is greater than 180, also subtract from 360 so all differences are less than 180, then order differences from smallest to largest
orderx <- order(ifelse(360-abs(df$StPt-valueEndPt) > 180,
abs(df$StPt-valueEndPt),
360-abs(df$StPt-valueEndPt)))
tmp <- as.data.frame(orderx)
# specify index value
index=1
# for as long as there is an "NA" present in the StPt_Next created before for loop...
while (is.na(df$StPt_Next[i])) {
#select the value of the ordered index in sequential order
j=orderx[index]
# if the start point associated with a given index is present in the list of used values...
if (df$StPt[j] %in% used){
# then have R print a statement indicate this is the case...
print(paste("passing ",df$StPt[j], " as it has already been used",sep=""))
# and move onto the next index
index=index+1
# break statement intended to skip the remainder of the code for values that have already been used
next
# if the start point associated with a given index is not present in the list of used values...
} else {
# then identify the start point value associated with that index ID...
valueStPt <- df$StPt[j]
# and have R print a statement indicating an attempt is being made to use the next value
print(paste("trying ",df$StPt[j],sep=""))
# if the end transect number is different from the start end transect number...
if (df$EndID[i] != df$StID[j]) {
# then put the start point in the new column...
df$StPt_Next[i] <- df$StPt[j]
# note which record this start point came from for ease of reference/troubleshooting...
df$ID_Next[i] <- j
# have R print a statement that indicates a value for the new column has beed selected...
print(paste("using ",df$StPt[j],sep=""))
# and add that start point to the list of used ones
used <- c(used,df$StPt[j])
# otherwise, if the end transect number matches the start end transect number...
} else {
# keep NA in this column and try again
df$StPt_Next[i] <- NA
# and indicate that this particular matched pair can not be used
print(paste("cant use ",valueStPt," as the column EndID (related to index in EndPt) and StID (related to index in StPt) values are matching",sep=""))
}# end if else statement to ensure that start and end points come from different transects
# and move onto the next index
index=index+1
}# end if else statement to determine if a given start point still needs to be used
}# end while loop to identify if there are still NA's in the new column
}# end for loop
The Output
When the code does not produce an explicit error, such as for the example data provided, the output is as follows:
StPt StID EndPt EndID StPt_Next ID_Next
1 342.1 1 122.3 1 67.9 4
2 189.3 2 313.9 2 173.2 7
3 116.5 3 198.7 3 97.8 8
4 67.9 4 160.4 4 72.0 5
5 72.0 5 166.0 5 116.5 3
6 208.4 6 26.7 6 189.3 2
7 173.2 7 312.7 7 168.7 9
8 97.8 8 273.7 8 138.2 10
9 168.7 9 288.8 9 208.4 6
10 138.2 10 287.5 10 NA NA
The last two columns were generated by the code and added to the original dataframe. StPt_Next has the location of the next closest start point and ID_Next indicates the transectID associated with that next start point location. The ID_Next column indicates that the order transects should be flown is as follows 1,4,5,3,8,10,NA (aka. the end), and 2,7,9,6 form their own loop that goes back to 2.
There are two specific problems that I can't solve:
(1) There is a problem of forming one continuous chain of sequence. I think this is related to having 1 be the starting transect and 10 being the last transect no matter what, but not knowing how to indicate in the code that the second to last transect must match up with 10 so that the sequence includes all 10 transects before terminating at an "NA" representing the final end point.
(2) To really automate this process, after fixing the early termination of the sequence due to the premature introduction of the "NA" as the ID_next, a new column would be made that would allow the transects to be reordered based on the most efficient progression rather than the original order of their EndID/StartID.
Intended Outcome
If we pretend that we only had 6 transects to order and ignore the 4 that were not able to be ordered due to the premature introduction of the "NA", this would be the intended outcome:
StPt StID EndPt EndID StPt_Next ID_Next TransNum
1 342.1 1 122.3 1 67.9 4 1
4 67.9 4 160.4 4 72.0 5 2
5 72.0 5 166.0 5 116.5 3 3
3 116.5 3 198.7 3 97.8 8 4
8 97.8 8 273.7 8 138.2 10 5
10 138.2 10 287.5 10 NA NA 6
EDIT: A Note About the Error Message Explicitly Produced by the Code
As indicated earlier, the code has a few flaws. Another flaw is that it will often produce an error when trying to order a larger number of transects. I am not entirely sure at what step in the process the error is generated, but I am guessing that it is related to the inability to match up the last few transects, possibly due to not meeting the criteria set forth by "orderx". The print statements say "trying NA" instead of a start point in the database, which results in this error: "Error in if (df$EndID[i] != df$StID[j]) { : missing value where TRUE/FALSE needed". I am guessing that there would need to be another if-else statement that somehow indicates "if the remaining points do not meet the orderx criteria, then just force them to match up with whatever transect remains so that everything is assigned a StPt_Next and ID_Next".
Here is a larger dataset that will generate the error:
EndPt <- c(158.7,245.1,187.1,298.2,346.8,317.2,74.5,274.2,153.4,246.7,193.6,302.3,6.8,359.1,235.4,134.5,111.2,240.5,359.2,121.3,224.5,212.6,155.1,353.1,181.7,334,249.3,43.9,38.5,75.7,344.3,45.1,285.7,155.5,183.8,60.6,301,132.1,75.9,112,342.1,302.1,288.1,47.4,331.3,3.4,185.3,62,323.7,188,313.1,171.6,187.6,291.4,19.2,210.3,93.3,24.8,83.1,193.8,112.7,204.3,223.3,210.7,201.2,41.3,79.7,175.4,260.7,279.5,82.4,200.2,254.2,228.9,1.4,299.9,102.7,123.7,172.9,23.2,207.3,320.1,344.6,39.9,223.8,106.6,156.6,45.7,236.3,98.1,337.2,296.1,194,307.1,86.6,65.5,86.6,296.4,94.7,279.9)
StPt <- c(56.3,158.1,82.4,185.5,243.9,195.6,335,167,39.4,151.7,99.8,177.2,246.8,266.1,118.2,358.6,357.9,99.6,209.9,342.8,106.5,86.4,35.7,200.6,65.6,212.5,159.1,297,285.9,300.9,177,245.2,153.1,8.1,76.5,322.4,190.8,35.2,342.6,8.8,244.6,202,176.2,308.3,184.2,267.2,26.6,293.8,167.3,30.5,176,74.3,96.9,186.7,288.2,62.6,331.4,254.7,324.1,73.4,16.4,64,110.9,74.4,69.8,298.8,336.6,58.8,170.1,173.2,330.8,92.6,129.2,124.7,262.3,140.4,321.2,34,79.5,263,66.4,172.8,205.5,288,98.5,335.2,38.7,289.7,112.7,350.7,243.2,185.4,63.9,170.3,326.3,322.9,320.6,199.2,287.1,158.1)
EndID <- c(seq(1, 100, 1))
StID <- c(seq(1, 100, 1))
df <- cbind.data.frame(StPt, StID, EndPt, EndID)
Any advice would be greatly appreciated!
As #chinsoon12 points out hidden in your problem you have an (Asymmetric) Traveling Salesman Problem. The asymmetry arises because the start and end points of your transecs are different.
ATSP is a renowned NP-complete problem. So exact solutions are very difficult even for medium sized problems (see wikipedia for more info). Hence the best we can do in most cases is approximations or heuristics. As you mention there are thousands of transects this is at least a medium sized problem.
Rather than code an ATSP approximation algorithm from the start, there is an existing TSP library for R. This includes several approximation algorithms. Reference documentation is here.
The follow is my use of the TSP package applied to your problem. Beginning with setup (assume I have run StPt, StID, EndPt, and EndID as in your question.
install.packages("TSP")
library(TSP)
library(dplyr)
# Dataframe
df <- cbind.data.frame(StPt, StID, EndPt, EndID)
# filter to 6 example nodes for requested comparison
df = df %>% filter(StID %in% c(1,3,4,5,8,10))
We shall use ATSP from a distance matrix. Position [row,col] in the matrix is the cost/distance of going from (the end of) transect row to (the start of) transect col. This code creates the entire distance matrix.
# distance calculation
transec_distance = function(end,start){
abs_dist = abs(start-end)
ifelse(360-abs_dist > 180, abs_dist, 360-abs_dist)
}
# distance matrix
matrix_distance = matrix(data = NA, nrow = nrow(df), ncol = nrow(df))
for(start_id in 1:nrow(df)){
start_point = df[start_id,'StPt']
for(end_id in 1:nrow(df)){
end_point = df[end_id,'EndPt']
matrix_distance[end_id,start_id] = transec_distance(end_point, start_point)
}
}
Note that there are more effective ways to construct a distance matrix. However, I have chosen this approach for its clarity. Depending on your computer and the exact number of transects this code may run very slowly.
Also, note that the size of this matrix is quadratic to the number of transects. So for a large number of transects, you will discover there is not enough memory.
The solving is very unexciting. The distance matrix gets turned into a ATSP object, and the ATSP object gets passed to the solver. We then proceed to add the ordering/traveling information to the original df.
answer = solve_TSP(as.ATSP(matrix_distance))
# get length of cycle
print(answer)
# sort df to same order as solution
df_w_answer = df[as.numeric(answer),]
# add info about next transect to each transect
df_w_answer = df_w_answer %>%
mutate(visit_order = 1:nrow(df_w_answer)) %>%
mutate(next_StID = lead(StID, order_by = visit_order),
next_StPt = lead(StPt, order_by = visit_order))
# add info about next transect to each transect (for final transect)
df_w_answer[df_w_answer$visit_order == nrow(df_w_answer),'next_StID'] =
df_w_answer[df_w_answer$visit_order == 1,'StID']
df_w_answer[df_w_answer$visit_order == nrow(df_w_answer),'next_StPt'] =
df_w_answer[df_w_answer$visit_order == 1,'StPt']
# compute distance between end of each transect and start of next
df_w_answer = df_w_answer %>% mutate(dist_between = transec_distance(EndPt, next_StPt))
At this point we have a cycle. You can pick any node as the starting point, follow the order given in the df: from EndID to next_StID, and you will cover every transect in (a good approximation to) the minimum distance.
However in your 'intended outcome' you have a path solution (e.g. start at transect 1 and finish at transect 10). We can turn the cycle into a path by excluding the single most expensive transition:
# as path (without returning to start)
min_distance = sum(df_w_answer$dist_between) - max(df_w_answer$dist_between)
path_start = df_w_answer[df_w_answer$dist_between == max(df_w_answer$dist_between), 'next_StID']
path_end = df_w_answer[df_w_answer$dist_between == max(df_w_answer$dist_between), 'EndID']
print(sprintf("minimum cost path = %.2f, starting at node %d, ending at node %d",
min_distance, path_start, path_end))
Running all the above gives me a different, but superior, answer to your intended outcome. I get the following order: 1 --> 5 --> 8 --> 4 --> 3 --> 10 --> 1.
You path from transect 1 to transect 10 has a total distance of 428, if we also returned from transect 10 to transect 1, making this a cycle, the total distance would be 483.
Using the TSP package in R we get a path from 1 to 10 with total distance 377, and as a cycle 431.
If we instead start at node 4 and end at node 8, we get a total distance of 277.
Some additional nodes:
Not all TSP solvers are deterministic, hence you may get some variation in your answer if you run again, or run with the input rows in a different order.
TSP is a much more general problem that the transect problem you described. It is possible that your problem has enough additional/special features that means it can be solved perfectly in a reasonable length of time. But this moves your problem into the realm of mathematics.
If you are running out of memory to create the distance matrix, take a look at the documentation for the TSP package. It contains several examples that use geo-coordinates as inputs rather than a distance matrix. This is a much smaller input size (presumably the package calculates the distances on the fly) so if you convert the start and end points to coordinates and specify euclidean (or some other common distance function) you could get around (some) computer memory limits.
Another version of using the TSP package...
Here is the setup.
library(TSP)
planeDim = 15
nTransects = 26
# generate some random transect beginning points in a plane, the size of which
# is defined by planeDim
b = cbind(runif(nTransects)*planeDim, runif(nTransects)*planeDim)
# generate some random transect ending points that are a distance of 1 from each
# beginning point
e = t(
apply(
b,
1,
function(x) {
bearing = runif(1)*2*pi
x + c(cos(bearing), sin(bearing))
}
)
)
For fun, we can visualize the transects:
# make an empty graph space
plot(1,1, xlim = c(-1, planeDim + 1), ylim = c(-1, planeDim + 1), ty = "n")
# plot the beginning of each transect as a green point, the end as a red point,
# with a thick grey line representing the transect
for(i in 1:nrow(e)) {
xs = c(b[i,1], e[i,1])
ys = c(b[i,2], e[i,2])
lines(xs, ys, col = "light grey", lwd = 4)
points(xs, ys, col = c("green", "red"), pch = 20, cex = 1.5)
text(mean(xs), mean(ys), letters[i])
}
So given a matrix of x,y pairs ("b") for beginning points and a matrix of x,y
pairs ("e") for end points of each transect, the solution is...
# a function that calculates the distance from all endpoints in the ePts matrix
# to the single beginning point in bPt
dist = function(ePts, bPt) {
# apply pythagorean theorem to calculate the distance between every end point
# in the matrix ePts to the point bPt
apply(ePts, 1, function(p) sum((p - bPt)^2)^0.5)
}
# apply the "dist" function to all begining points to create the distance
# matrix. since the distance from the end of transect "foo" to the beginning of
# "bar" is not the same as from the end of "bar" to the beginning of "foo," we
# have an asymmetric travelling sales person problem. Therefore, distance
# matrix is directional. The distances at any position in the matrix must be
# the distance from the transect shown in the row label and to the transect
# shown in the column label.
distMatrix = apply(b, 1, FUN = dist, ePts = e)
# for covenience, we'll labels the trasects a to z
dimnames(distMatrix) = list(letters, letters)
# set the distance between the beginning and end of each transect to zero so
# that there is no "cost" to walking the transect
diag(distMatrix) = 0
Here is the upper left corner of the distance matrix:
> distMatrix[1:6, 1:6]
a b c d e f
a 0.00000 15.4287270 12.637979 12.269356 15.666710 12.3919715
b 13.58821 0.0000000 5.356411 13.840444 1.238677 12.6512352
c 12.48161 6.3086852 0.000000 8.427033 6.382304 7.1387840
d 10.69748 13.5936114 7.708183 0.000000 13.718517 0.9836146
e 14.00920 0.7736654 5.980220 14.470826 0.000000 13.2809601
f 12.24503 12.8987043 6.984763 2.182829 12.993283 0.0000000
Now three lines of code from the TSP package solves the problem.
atsp = as.ATSP(distMatrix)
tour = solve_TSP(atsp)
# assume we want to start our circuit at transect "a".
path = cut_tour(tour, "a", exclude_cut = F)
The variable path shows the order in which you should visit the transects:
> path
a w x q i o l d f s h y g v t k c m e b p u z j r n
1 23 24 17 9 15 12 4 6 19 8 25 7 22 20 11 3 13 5 2 16 21 26 10 18 14
We can add the path to the visualization:
for(i in 1:(length(path)-1)) {
lines(c(e[path[i],1], b[path[i+1],1]), c(e[path[i],2], b[path[i+1], 2]), lty = "dotted")
}
Thanks everyone for the suggestions, #Simon's solution was most tailored to my OP. #Geoffrey's actual approach of using x,y coordinates was great as it allows for the plotting of the transects and the travel order. Thus, I am posting a hybrid solution that was generated using code by both of them and well as additional comments and code to detail the process and get to the actual end result I was aiming for. I am not sure if this will help anyone in the future but, since there was no answer that provided a solution that solved my problem 100% of the way, I thought I'd share what I came up with.
As others have noted, this is a type of traveling salesperson problem. It is asymmetric because the distance from the end of transect "t" to the beginning of transect "t+1" is not the same as the distance from the end transect "t+1" to the start of transect "t". Also, it is a "path" solution rather than a "cycle" solution.
#=========================================
# Packages
#=========================================
library(TSP)
library(useful)
library(dplyr)
#=========================================
# Full dataset for testing
#=========================================
EndPt <- c(158.7,245.1,187.1,298.2,346.8,317.2,74.5,274.2,153.4,246.7,193.6,302.3,6.8,359.1,235.4,134.5,111.2,240.5,359.2,121.3,224.5,212.6,155.1,353.1,181.7,334,249.3,43.9,38.5,75.7,344.3,45.1,285.7,155.5,183.8,60.6,301,132.1,75.9,112,342.1,302.1,288.1,47.4,331.3,3.4,185.3,62,323.7,188,313.1,171.6,187.6,291.4,19.2,210.3,93.3,24.8,83.1,193.8,112.7,204.3,223.3,210.7,201.2,41.3,79.7,175.4,260.7,279.5,82.4,200.2,254.2,228.9,1.4,299.9,102.7,123.7,172.9,23.2,207.3,320.1,344.6,39.9,223.8,106.6,156.6,45.7,236.3,98.1,337.2,296.1,194,307.1,86.6,65.5,86.6,296.4,94.7,279.9)
StPt <- c(56.3,158.1,82.4,185.5,243.9,195.6,335,167,39.4,151.7,99.8,177.2,246.8,266.1,118.2,358.6,357.9,99.6,209.9,342.8,106.5,86.4,35.7,200.6,65.6,212.5,159.1,297,285.9,300.9,177,245.2,153.1,8.1,76.5,322.4,190.8,35.2,342.6,8.8,244.6,202,176.2,308.3,184.2,267.2,26.6,293.8,167.3,30.5,176,74.3,96.9,186.7,288.2,62.6,331.4,254.7,324.1,73.4,16.4,64,110.9,74.4,69.8,298.8,336.6,58.8,170.1,173.2,330.8,92.6,129.2,124.7,262.3,140.4,321.2,34,79.5,263,66.4,172.8,205.5,288,98.5,335.2,38.7,289.7,112.7,350.7,243.2,185.4,63.9,170.3,326.3,322.9,320.6,199.2,287.1,158.1)
EndID <- c(seq(1, 100, 1))
StID <- c(seq(1, 100, 1))
df <- cbind.data.frame(StPt, StID, EndPt, EndID)
#=========================================
# Convert polar coordinates to cartesian x,y data
#=========================================
# Area that the transect occupy in space only used for graphing
planeDim <- 1
# Number of transects
nTransects <- 100
# Convert 360-degree polar coordinates to x,y cartesian coordinates to facilitate calculating a distance matrix based on the Pythagorean theorem
EndX <- as.matrix(pol2cart(planeDim, EndPt, degrees = TRUE)["x"])
EndY <- as.matrix(pol2cart(planeDim, EndPt, degrees = TRUE)["y"])
StX <- as.matrix(pol2cart(planeDim, StPt, degrees = TRUE)["x"])
StY <- as.matrix(pol2cart(planeDim, StPt, degrees = TRUE)["y"])
# Matrix of x,y pairs for the beginning ("b") and end ("e") points of each transect
b <- cbind(c(StX), c(StY))
e <- cbind(c(EndX), c(EndY))
#=========================================
# Function to calculate the distance from all endpoints in the ePts matrix to a single beginning point in bPt
#=========================================
dist <- function(ePts, bPt) {
# Use the Pythagorean theorem to calculate the hypotenuse (i.e., distance) between every end point in the matrix ePts to the point bPt
apply(ePts, 1, function(p) sum((p - bPt)^2)^0.5)
}
#=========================================
# Distance matrix
#=========================================
# Apply the "dist" function to all beginning points to create a matrix that has the distance between every start and endpoint
## Note: because this is an asymmetric traveling salesperson problem, the distance matrix is directional, thus, the distances at any position in the matrix must be the distance from the transect shown in the row label and to the transect shown in the column label
distMatrix <- apply(b, 1, FUN = dist, ePts = e)
## Set the distance between the beginning and end of each transect to zero so that there is no "cost" to walking the transect
diag(distMatrix) <- 0
#=========================================
# Solve asymmetric TSP
#=========================================
# This creates an instance of the asymmetric traveling salesperson (ASTP)
atsp <- as.ATSP(distMatrix)
# This creates an object of Class Tour that travels to all of the points
## In this case, the repetitive_nn produces the smallest overall and transect-to-transect
tour <- solve_TSP(atsp, method = "repetitive_nn")
#=========================================
# Create a path by cutting the tour at the most "expensive" transition
#=========================================
# Sort the original data frame to match the order of the solution
dfTour = df[as.numeric(tour),]
# Add the following columns to the original dataframe:
dfTour = dfTour %>%
# Assign visit order (1 to 100, ascending)
mutate(visit_order = 1:nrow(dfTour)) %>%
# The ID of the next transect to move to
mutate(next_StID = lead(StID, order_by = visit_order),
# The angle of the start point for the next transect
next_StPt = lead(StPt, order_by = visit_order))
# lead() generates the NA's in the last record for next_StID, next_StPt, replace these by adding that information
dfTour[dfTour$visit_order == nrow(dfTour),'next_StID'] <-
dfTour[dfTour$visit_order == 1,'StID']
dfTour[dfTour$visit_order == nrow(dfTour),'next_StPt'] <-
dfTour[dfTour$visit_order == 1,'StPt']
# Function to calculate distance for 360 degrees rather than x,y coordinates
transect_distance <- function(end,start){
abs_dist = abs(start-end)
ifelse(360-abs_dist > 180, abs_dist, 360-abs_dist)
}
# Compute distance between end of each transect and start of next using polar coordinates
dfTour = dfTour %>% mutate(dist_between = transect_distance(EndPt, next_StPt))
# Identify the longest transition point for breaking the cycle
min_distance <- sum(dfTour$dist_between) - max(dfTour$dist_between)
path_start <- dfTour[dfTour$dist_between == max(dfTour$dist_between), 'next_StID']
path_end <- dfTour[dfTour$dist_between == max(dfTour$dist_between), 'EndID']
# Make a statement about the least cost path
print(sprintf("minimum cost path = %.2f, starting at node %d, ending at node %d",
min_distance, path_start, path_end))
# The variable path shows the order in which you should visit the transects
path <- cut_tour(tour, path_start, exclude_cut = F)
# Arrange df from smallest to largest travel distance
tmp1 <- dfTour %>%
arrange(dist_between)
# Change dist_between and visit_order to NA for transect with the largest distance to break cycle
# (e.g., we will not travel this distance, this represents the path endpoint)
tmp1[length(dfTour$dist_between):length(dfTour$dist_between),8] <- NA
tmp1[length(dfTour$dist_between):length(dfTour$dist_between),5] <- NA
# Set df order back to ascending by visit order
tmp2 <- tmp1 %>%
arrange(visit_order)
# Detect the break in a sequence of visit_order introduced by the NA (e.g., 1,2,3....5,6) and mark groups before the break with 0 and after the break with 1 in the "cont_per" column
tmp2$cont_per <- cumsum(!c(TRUE, diff(tmp2$visit_order)==1))
# Sort "cont_per" such that the records following the break become the beginning of the path and the ones following the break represent the middle orders and the point with the NA being assigned the last visit order, and assign a new visit order
tmp3 <- tmp2%>%
arrange(desc(cont_per))%>%
mutate(visit_order_FINAL=seq(1, length(tmp2$visit_order), 1))
# Datframe ordered by progression of transects
trans_order <- cbind.data.frame(tmp3[2], tmp3[1], tmp3[4], tmp3[3], tmp3[6], tmp3[7], tmp3[8], tmp3[10])
# Insert NAs for "next" info for final transect
trans_order[nrow(trans_order),'next_StPt'] <- NA
trans_order[nrow(trans_order), 'next_StID'] <- NA
#=========================================
# View data
#=========================================
head(trans_order)
#=========================================
# Plot
#=========================================
#For fun, we can visualize the transects:
# make an empty graph space
plot(1,1, xlim = c(-planeDim-0.1, planeDim+0.1), ylim = c(-planeDim-0.1, planeDim+0.1), ty = "n")
# plot the beginning of each transect as a green point, the end as a red point,
and a grey line representing the transect
for(i in 1:nrow(e)) {
xs = c(b[i,1], e[i,1])
ys = c(b[i,2], e[i,2])
lines(xs, ys, col = "light grey", lwd = 1, lty = 1)
points(xs, ys, col = c("green", "red"), pch = 1, cex = 1)
#text((xs), (ys), i)
}
# Add the path to the visualization
for(i in 1:(length(path)-1)) {
# This makes a line between the x coordinates for the end point of path i and beginning point of path i+1
lines(c(e[path[i],1], b[path[i+1],1]), c(e[path[i],2], b[path[i+1], 2]), lty = 1, lwd=1)
}
This is what the end result looks like

Find correct 2D translation of a subset of coordinates

I have a problem I wish to solve in R with example data below. I know this must have been solved many times but I have not been able to find a solution that works for me in R.
The core of what I want to do is to find how to translate a set of 2D coordinates to best fit into an other, larger, set of 2D coordinates. Imagine for example having a Polaroid photo of a small piece of the starry sky with you out at night, and you want to hold it up in a position so they match the stars' current positions.
Here is how to generate data similar to my real problem:
# create reference points (the "starry sky")
set.seed(99)
ref_coords = data.frame(x = runif(50,0,100), y = runif(50,0,100))
# generate points take subset of coordinates to serve as points we
# are looking for ("the Polaroid")
my_coords_final = ref_coords[c(5,12,15,24,31,34,48,49),]
# add a little bit of variation as compared to reference points
# (data should very similar, but have a little bit of noise)
set.seed(100)
my_coords_final$x = my_coords_final$x+rnorm(8,0,.1)
set.seed(101)
my_coords_final$y = my_coords_final$y+rnorm(8,0,.1)
# create "start values" by, e.g., translating the points we are
# looking for to start at (0,0)
my_coords_start =apply(my_coords_final,2,function(x) x-min(x))
# Plot of example data, goal is to find the dotted vector that
# corresponds to the translation needed
plot(ref_coords, cex = 1.2) # "Starry sky"
points(my_coords_start,pch=20, col = "red") # start position of "Polaroid"
points(my_coords_final,pch=20, col = "blue") # corrected position of "Polaroid"
segments(my_coords_start[1,1],my_coords_start[1,2],
my_coords_final[1,1],my_coords_final[1,2],lty="dotted")
Plotting the data as above should yield:
The result I want is basically what the dotted line in the plot above represents, i.e. a delta in x and y that I could apply to the start coordinates to move them to their correct position in the reference grid.
Details about the real data
There should be close to no rotational or scaling difference between my points and the reference points.
My real data is around 1000 reference points and up to a few hundred points to search (could use less if more efficient)
I expect to have to search about 10 to 20 sets of reference points to find my match, as many of the reference sets will not contain my points.
Thank you for your time, I'd really appreciate any input!
EDIT: To clarify, the right plot represent the reference data. The left plot represents the points that I want to translate across the reference data in order to find a position where they best match the reference. That position, in this case, is represented by the blue dots in the previous figure.
Finally, any working strategy must not use the data in my_coords_final, but rather reproduce that set of coordinates starting from my_coords_start using ref_coords.
So, the previous approach I posted (see edit history) using optim() to minimize the sum of distances between points will only work in the limited circumstance where the point distribution used as reference data is in the middle of the point field. The solution that satisfies the question and seems to still be workable for a few thousand points, would be a brute-force delta and comparison algorithm that calculates the differences between each point in the field against a single point of the reference data and then determines how many of the rest of the reference data are within a minimum threshold (which is needed to account for the noise in the data):
## A brute-force approach where min_dist can be used to
## ameliorate some random noise:
min_dist <- 5
win_thresh <- 0
win_thresh_old <- 0
for(i in 1:nrow(ref_coords)) {
x2 <- my_coords_start[,1]
y2 <- my_coords_start[,2]
x1 <- ref_coords[,1] + (x2[1] - ref_coords[i,1])
y1 <- ref_coords[,2] + (y2[1] - ref_coords[i,2])
## Calculate all pairwise distances between reference and field data:
dists <- dist( cbind( c(x1, x2), c(y1, y2) ), "euclidean")
## Only take distances for the sampled data:
dists <- as.matrix(dists)[-1*1:length(x1),]
## Calculate the number of distances within the minimum
## distance threshold minus the diagonal portion:
win_thresh <- sum(rowSums(dists < min_dist) > 1)
## If we have more "matches" than our best then calculate a new
## dx and dy:
if (win_thresh > win_thresh_old) {
win_thresh_old <- win_thresh
dx <- (x2[1] - ref_coords[i,1])
dy <- (y2[1] - ref_coords[i,2])
}
}
## Plot estimated correction (your delta x and delta y) calculated
## from the brute force calculation of shifts:
points(
x=ref_coords[,1] + dx,
y=ref_coords[,2] + dy,
cex=1.5, col = "red"
)
I'm very interested to know if there's anyone that solves this in a more efficient manner for the number of points in the test data, possibly using a statistical or optimization algorithm.

Finding the best matching pairwise points from 2 vectors

I have 2 lists with X,Y coordinates of points.
List 1 contains more points than list 2.
The task is to find pairs of points in a way that the overall euclidean distance is minimized.
I have a working code, but i don't know if this is the best way and I would like to get hint what I can improve for result (better algorithm to find the minimum ) or speed, because the list are about 2000 elements each.
The round in the sample vectors is implemented to get also points with same distances.
With the "rdist" function all distances are generated in "distances". Than the minimum in the matrix is used to link 2 point ("dist_min"). All distances of these 2 points are now replaced by NA and the loop continues by searching the next minimum until all points of list 2 have a point from list 1.
At the end I have added a plot for visualization.
require(fields)
set.seed(1)
x1y1.data <- matrix(round(runif(200*2),2), ncol = 2) # generate 1st set of points
x2y2.data <- matrix(round(runif(100*2),2), ncol = 2) # generate 2nd set of points
distances <- rdist(x1y1.data, x2y2.data)
dist_min <- matrix(data=NA,nrow=ncol(distances),ncol=7) # prepare resulting vector with 7 columns
for(i in 1:ncol(distances))
{
inds <- which(distances == min(distances,na.rm = TRUE), arr.ind=TRUE)
dist_min[i,1] <- inds[1,1] # row of point(use 1st element of inds if points have same distance)
dist_min[i,2] <- inds[1,2] # column of point (use 1st element of inds if points have same distance)
dist_min[i,3] <- distances[inds[1,1],inds[1,2]] # distance of point
dist_min[i,4] <- x1y1.data[inds[1,1],1] # X1 ccordinate of 1st point
dist_min[i,5] <- x1y1.data[inds[1,1],2] # Y1 coordinate of 1st point
dist_min[i,6] <- x2y2.data[inds[1,2],1] # X2 coordinate of 2nd point
dist_min[i,7] <- x2y2.data[inds[1,2],2] # Y2 coordinate of 2nd point
distances[inds[1,1],] <- NA # remove row (fill with NA), where minimum was found
distances[,inds[1,2]] <- NA # remove column (fill with NA), where minimum was found
}
# plot 1st set of points
# print mean distance as measure for optimization
plot(x1y1.data,col="blue",main="mean of min_distances",sub=mean(dist_min[,3],na.rm=TRUE))
points(x2y2.data,col="red") # plot 2nd set of points
segments(dist_min[,4],dist_min[,5],dist_min[,6],dist_min[,7]) # connect pairwise according found minimal distance
This is a fundamental problem in combinatorial optimization known as the assignment problem. One approach to solving the assignment problem is the Hungarian algorithm which is implemented in the R package clue:
require(clue)
sol <- solve_LSAP(t(distances))
We can verify that it outperforms the naive solution:
mean(dist_min[,3])
# [1] 0.05696033
mean(sqrt(
(x2y2.data[,1] - x1y1.data[sol, 1])^2 +
(x2y2.data[,2] - x1y1.data[sol, 2])^2))
#[1] 0.05194625
And we can construct a similar plot to the one in your question:
plot(x1y1.data,col="blue")
points(x2y2.data,col="red")
segments(x2y2.data[,1], x2y2.data[,2], x1y1.data[sol, 1], x1y1.data[sol, 2])

R calculate possible values of two variables

I'm trying to calculate all the possible values of a grid size (x by y) that lead to the same number of cells, so for example a 2x2 grid has a cell size of 4. I want the y to be half of the x, and the total to be, for example 4000. So I guess I want R to calculate all the possible positive integer values of x and y where
function (total) {
x*y=total
x/y=2
x!=total
y!= total.
}
I suppose one way to get positive integers and to consider different solutions would be to allow the total to be up to 10% larger than its original value (but not smaller, I need the grid to be at least as big as the total value I give), in which case the function could have two fields, tot (e.g. 4000) and tolerance (e.g. 10%). Total (as used in the sketch function above) than has to be between tot and (tot+tolerance*tot)
I have several cell sizes so 4000 is only one example. I'm trying to build a quick function which returns positive integers only and returns a matrix of Xs and Ys.
Any ideas?
Many thanks
What about this:
possible.sizes <- function(total, tolerance) {
min.total <- total
max.total <- total * (1 + tolerance)
min.y <- ceiling(sqrt(min.total/2))
max.y <- floor(sqrt(max.total/2))
if (max.y < min.y)
return(data.frame(x=numeric(0), y=numeric(0)))
y <- seq(min.y, max.y)
x <- 2*y
return(data.frame(x=x, y=y))
}
possible.sizes(4000, 0.1)
# x y
# 1 90 45
# 2 92 46

Detecting dips in a 2D plot

I need to automatically detect dips in a 2D plot, like the regions marked with red circles in the figure below. I'm only interested in the "main" dips, meaning the dips have to span a minimum length in the x axis. The number of dips is unknown, i.e., different plots will contain different numbers of dips. Any ideas?
Update:
As requested, here's the sample data, together with an attempt to smooth it using median filtering, as suggested by vines.
Looks like I need now a robust way to approximate the derivative at each point that would ignore the little blips that remain in the data. Is there any standard approach?
y <- c(0.9943,0.9917,0.9879,0.9831,0.9553,0.9316,0.9208,0.9119,0.8857,0.7951,0.7605,0.8074,0.7342,0.6374,0.6035,0.5331,0.4781,0.4825,0.4825,0.4879,0.5374,0.4600,0.3668,0.3456,0.4282,0.3578,0.3630,0.3399,0.3578,0.4116,0.3762,0.3668,0.4420,0.4749,0.4556,0.4458,0.5084,0.5043,0.5043,0.5331,0.4781,0.5623,0.6604,0.5900,0.5084,0.5802,0.5802,0.6174,0.6124,0.6374,0.6827,0.6906,0.7034,0.7418,0.7817,0.8311,0.8001,0.7912,0.7912,0.7540,0.7951,0.7817,0.7644,0.7912,0.8311,0.8311,0.7912,0.7688,0.7418,0.7232,0.7147,0.6906,0.6715,0.6681,0.6374,0.6516,0.6650,0.6604,0.6124,0.6334,0.6374,0.5514,0.5514,0.5412,0.5514,0.5374,0.5473,0.4825,0.5084,0.5126,0.5229,0.5126,0.5043,0.4379,0.4781,0.4600,0.4781,0.3806,0.4078,0.3096,0.3263,0.3399,0.3184,0.2820,0.2167,0.2122,0.2080,0.2558,0.2255,0.1921,0.1766,0.1732,0.1205,0.1732,0.0723,0.0701,0.0405,0.0643,0.0771,0.1018,0.0587,0.0884,0.0884,0.1240,0.1088,0.0554,0.0607,0.0441,0.0387,0.0490,0.0478,0.0231,0.0414,0.0297,0.0701,0.0502,0.0567,0.0405,0.0363,0.0464,0.0701,0.0832,0.0991,0.1322,0.1998,0.3146,0.3146,0.3184,0.3578,0.3311,0.3184,0.4203,0.3578,0.3578,0.3578,0.4282,0.5084,0.5802,0.5667,0.5473,0.5514,0.5331,0.4749,0.4037,0.4116,0.4203,0.3184,0.4037,0.4037,0.4282,0.4513,0.4749,0.4116,0.4825,0.4918,0.4879,0.4918,0.4825,0.4245,0.4333,0.4651,0.4879,0.5412,0.5802,0.5126,0.4458,0.5374,0.4600,0.4600,0.4600,0.4600,0.3992,0.4879,0.4282,0.4333,0.3668,0.3005,0.3096,0.3847,0.3939,0.3630,0.3359,0.2292,0.2292,0.2748,0.3399,0.2963,0.2963,0.2385,0.2531,0.1805,0.2531,0.2786,0.3456,0.3399,0.3491,0.4037,0.3885,0.3806,0.2748,0.2700,0.2657,0.2963,0.2865,0.2167,0.2080,0.1844,0.2041,0.1602,0.1416,0.2041,0.1958,0.1018,0.0744,0.0677,0.0909,0.0789,0.0723,0.0660,0.1322,0.1532,0.1060,0.1018,0.1060,0.1150,0.0789,0.1266,0.0965,0.1732,0.1766,0.1766,0.1805,0.2820,0.3096,0.2602,0.2080,0.2333,0.2385,0.2385,0.2432,0.1602,0.2122,0.2385,0.2333,0.2558,0.2432,0.2292,0.2209,0.2483,0.2531,0.2432,0.2432,0.2432,0.2432,0.3053,0.3630,0.3578,0.3630,0.3668,0.3263,0.3992,0.4037,0.4556,0.4703,0.5173,0.6219,0.6412,0.7275,0.6984,0.6756,0.7079,0.7192,0.7342,0.7458,0.7501,0.7540,0.7605,0.7605,0.7342,0.7912,0.7951,0.8036,0.8074,0.8074,0.8118,0.7951,0.8118,0.8242,0.8488,0.8650,0.8488,0.8311,0.8424,0.7912,0.7951,0.8001,0.8001,0.7458,0.7192,0.6984,0.6412,0.6516,0.5900,0.5802,0.5802,0.5762,0.5623,0.5374,0.4556,0.4556,0.4333,0.3762,0.3456,0.4037,0.3311,0.3263,0.3311,0.3717,0.3762,0.3717,0.3668,0.3491,0.4203,0.4037,0.4149,0.4037,0.3992,0.4078,0.4651,0.4967,0.5229,0.5802,0.5802,0.5846,0.6293,0.6412,0.6374,0.6604,0.7317,0.7034,0.7573,0.7573,0.7573,0.7772,0.7605,0.8036,0.7951,0.7817,0.7869,0.7724,0.7869,0.7869,0.7951,0.7644,0.7912,0.7275,0.7342,0.7275,0.6984,0.7342,0.7605,0.7418,0.7418,0.7275,0.7573,0.7724,0.8118,0.8521,0.8823,0.8984,0.9119,0.9316,0.9512)
yy <- runmed(y, 41)
plot(y, type="l", ylim=c(0,1), ylab="", xlab="", lwd=0.5)
points(yy, col="blue", type="l", lwd=2)
EDITED : function strips the regions to contain nothing but the lowest part, if wanted.
Actually, Using the mean is easier than using the median. This allows you to find regions where the real values are continuously below the mean. The median is not smooth enough for an easy application.
One example function to do this would be :
FindLowRegion <- function(x,n=length(x)/4,tol=length(x)/20,p=0.5){
nx <- length(x)
n <- 2*(n %/% 2) + 1
# smooth out based on means
sx <- rowMeans(embed(c(rep(NA,n/2),x,rep(NA,n/2)),n),na.rm=T)
# find which series are far from the mean
rlesx <- rle((sx-x)>0)
# construct start and end of regions
int <- embed(cumsum(c(1,rlesx$lengths)),2)
# which regions fulfill requirements
id <- rlesx$value & rlesx$length > tol
# Cut regions to be in general smaller than median
regions <-
apply(int[id,],1,function(i){
i <- min(i):max(i)
tmp <- x[i]
id <- which(tmp < quantile(tmp,p))
id <- min(id):max(id)
i[id]
})
# return
unlist(regions)
}
where
n determines how much values are used to calculate the running mean,
tol determines how many consecutive values should be lower than the running mean to talk about a low region, and
p determines the cutoff used (as a quantile) for stripping the regions to their lowest part. When p=1, the complete lower region is shown.
Function is tweaked to work on data as you presented, but the numbers might need to be adjusted a bit to work with other data.
This function returns a set of indices, which allows you to find the low regions. Illustrated with your y vector :
Lows <- FindLowRegion(y)
newx <- seq_along(y)
newy <- ifelse(newx %in% Lows,y,NA)
plot(y, col="blue", type="l", lwd=2)
lines(newx,newy,col="red",lwd="3")
Gives :
You have to smooth the graph in some way. Median filtration is quite useful for that purpose (see http://en.wikipedia.org/wiki/Median_filter). After smoothing, you will simply have to search for the minima, just as usual (i.e. search for the points where the 1st derivative switches from negative to positive).
A simpler answer (which also does not require smoothing) could be provided by adapting the maxdrawdown() function from the tseries. A drawdown is commonly defined as the retreat from the most-recent maximum; here we want the opposite. Such a function could then be used in a sliding window over the data, or over segmented data.
maxdrawdown <- function(x) {
if(NCOL(x) > 1)
stop("x is not a vector or univariate time series")
if(any(is.na(x)))
stop("NAs in x")
cmaxx <- cummax(x)-x
mdd <- max(cmaxx)
to <- which(mdd == cmaxx)
from <- double(NROW(to))
for (i in 1:NROW(to))
from[i] <- max(which(cmaxx[1:to[i]] == 0))
return(list(maxdrawdown = mdd, from = from, to = to))
}
So instead of using cummax(), one would have to switch to cummin() etc.
My first thought was something much cruder than filtering. Why not look for the big drops followed by long enough stable periods?
span.b <- 20
threshold.b <- 0.2
dy.b <- c(rep(NA, span.b), diff(y, lag = span.b))
span.f <- 10
threshold.f <- 0.05
dy.f <- c(diff(y, lag = span.f), rep(NA, span.f))
down <- which(dy.b < -1 * threshold.b & abs(dy.f) < threshold.f)
abline(v = down)
The plot shows that it's not perfect, but it doesn't discard the outliers (I guess it depends on your take on the data).

Resources