Related
I am working with the R programming language.
I am trying to recreate the travelling salesman problem. The travelling salesman problem is a problem where a salesman has to visit "n" number of cities exactly once, in such a way that his total distance is minimized.
For this problem, I first created a dataset made of n = 6 cities (longitude, latitude):
set.seed(123)
data_1 = data.frame(id = c(1,2,3), long = rnorm(3, -74, 1 ), lat = rnorm(3, 40, 1 ))
data_2 = data.frame(id = c(4,5,6), long = rnorm(3, -78, 1 ), lat = rnorm(3, 42, 1 ))
final_data = rbind(data_1, data_2)
final_data
id long lat
1 1 -74.56048 40.07051
2 2 -74.23018 40.12929
3 3 -72.44129 41.71506
4 4 -77.53908 41.55434
5 5 -79.26506 43.22408
6 6 -78.68685 42.35981
For a given order of cities (e.g. 1,2,3,4,5,6), I created a function which determines the distance (based on the Euclidean Distance) between each successive pair of cities, and then calculates the total distance travelled:
distance <- function( long1, lat1, long2, lat2, long3, lat3, long4, lat4, long5, lat5, long6, lat6) {
d1_2 = sqrt( (long1 - lat1)^2 + (long2 - lat2)^2 )
d2_3 = sqrt( (long2 - lat2)^2 + (lat3 - long3)^2 )
d3_4 = sqrt( (long3 - lat3)^2 + (long4 - lat4)^2 )
d4_5 = sqrt( (long4 - lat4)^2 + (long5 - lat5)^2 )
d5_6 = sqrt( (long5 - lat5)^2 + (long6 - lat6)^2 )
return( d1_2 + d2_3 + d3_4 + d4_5 + d5_6 )
}
distance(final_data[1,2], final_data[1,3], final_data[2,2], final_data[2,3], final_data[3,2], final_data[3,3], final_data[4,2], final_data[4,3], final_data[5,2], final_data[5,3], final_data[6,2], final_data[6,3])
Then, I can randomize the order of the rows to obtain different routes and calculate the distance for each route:
#first route
rows <- sample(nrow(final_data))
route_1 <- final_data[rows, ]
> route_1
id long lat
1 1 -74.56048 40.07051
3 3 -72.44129 41.71506
4 4 -77.53908 41.55434
2 2 -74.23018 40.12929
6 6 -78.68685 42.35981
5 5 -79.26506 43.22408
distance(route_1[1,2], route_1[1,3], route_1[2,2], route_1[2,3], route_1[3,2], route_1[3,3], route_1[4,2], route_1[4,3], route_1[5,2], route_1[5,3], route_1[6,2], route_1[6,3])
[1] 830.5902
Next Route:
#second route
rows <- sample(nrow(final_data))
route_2 <- final_data[rows, ]
> route_2
id long lat
5 5 -79.26506 43.22408
4 4 -77.53908 41.55434
3 3 -72.44129 41.71506
2 2 -74.23018 40.12929
1 1 -74.56048 40.07051
6 6 -78.68685 42.35981
distance(route_2[1,2], route_2[1,3], route_2[2,2], route_2[2,3], route_2[3,2], route_2[3,3], route_2[4,2], route_2[4,3], route_2[5,2], route_2[5,3], route_2[6,2], route_2[6,3])
[1] 826.028
#etc
My Question: In the spirit of the Travelling Salesman Problem, I am trying to (ironically) show that what I am doing is extremely inefficient and will not work for more than 10 cities (i.e. take too long to run). In the case of 6 cities, can someone please show me how to calculate the distance for every possible route (6! = 720 routes) and calculate the time required to compute all these distances?
Here is what I know how to do so far:
Part 1: Generate All Possible Routes
library(combinat)
all_routes = permn(c(1,2,3,4,5,6))
> head(all_routes)
[[1]]
[1] 1 2 3 4 5 6
[[2]]
[1] 1 2 3 4 6 5
[[3]]
[1] 1 2 3 6 4 5
[[4]]
[1] 1 2 6 3 4 5
[[5]]
[1] 1 6 2 3 4 5
[[6]]
[1] 6 1 2 3 4 5
Part 2: Record the Time Required to Calculate a Single Route
start.time <- Sys.time()
distance(route_1[1,2], route_1[1,3], route_1[2,2], route_1[2,3], route_1[3,2], route_1[3,3], route_1[4,2], route_1[4,3], route_1[5,2], route_1[5,3], route_1[6,2], route_1[6,3])
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.003665924 secs
Can someone please show me how to put this all together?
Thanks!
To calculate the cumulative distance for all 6! routes for the given final_data could be done like this:
set.seed(123)
data_1 = data.frame(id = c(1,2,3), long = rnorm(3, -74, 1 ), lat = rnorm(3, 40, 1 ))
data_2 = data.frame(id = c(4,5,6), long = rnorm(3, -78, 1 ), lat = rnorm(3, 42, 1 ))
final_data = rbind(data_1, data_2)
N <- nrow(final_data) # just for repeated convenience
final_data
# id long lat
# 1 1 -74.56048 40.07051
# 2 2 -74.23018 40.12929
# 3 3 -72.44129 41.71506
# 4 4 -77.53908 41.55434
# 5 5 -79.26506 43.22408
# 6 6 -78.68685 42.35981
Calculate the distances between each city, pair-wise. I'm using distHaversine because you listed lat/lon, and part of me cringes seeing cartesian distance calcs applied to that :-)
dists <- outer(seq_len(N), seq_len(N), function(a,b) {
geosphere::distHaversine(final_data[a,2:3], final_data[b,2:3]) # Notes 1, 2
})
dists
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.00 28876.24 255554.4 300408.5 525566.9 429264.3
# [2,] 28876.24 0.00 231942.7 320616.0 541980.9 448013.6
# [3,] 255554.43 231942.67 0.0 424449.9 584761.5 521210.7
# [4,] 300408.47 320616.03 424449.9 0.0 233840.9 130640.9
# [5,] 525566.87 541980.93 584761.5 233840.9 0.0 107178.2
# [6,] 429264.34 448013.57 521210.7 130640.9 107178.2 0.0
(Units are in meters.)
Calculate the cumulative distance along each of the routes:
perms <- gtools::permutations(N, N)
nrow(perms)
# [1] 720
perms[c(1:4, 719:720),]
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 2 3 4 5 6
# [2,] 1 2 3 4 6 5
# [3,] 1 2 3 5 4 6
# [4,] 1 2 3 5 6 4
# [5,] 6 5 4 3 1 2
# [6,] 6 5 4 3 2 1
allroutes5 <- t(apply(perms, 1, function(route) {
dists[cbind(route[-N], route[-1])]
}))
head(allroutes5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 28876.24 231942.7 424449.9 233840.9 107178.2
# [2,] 28876.24 231942.7 424449.9 130640.9 107178.2
# [3,] 28876.24 231942.7 584761.5 233840.9 130640.9
# [4,] 28876.24 231942.7 584761.5 107178.2 130640.9
# [5,] 28876.24 231942.7 521210.7 130640.9 233840.9
# [6,] 28876.24 231942.7 521210.7 107178.2 233840.9
allroutes_total <- rowSums(allroutes5)
head(allroutes_total)
# [1] 1026287.9 923087.9 1210062.2 1083399.4 1146511.4 1123048.7
As confirmation of this, the first row of allroutes5 is the sequence of cities 1, 2, 3, 4, 5, and 6. Recalling dists above, from 1-2 is 28876; 2-3 is 231942; 3-4 is 424449; etc. Sum these up, and we have the total distance traveled over all cities in that route. allroutes_total holds the distances for all 720 possible routings (permutations).
min(allroutes_total)
# [1] 799046.4
which.min(allroutes_total)
# [1] 266
perms[which.min(allroutes_total),]
# [1] 3 2 1 4 6 5
Notes:
Using your formula, I was able to duplicate your distances:
dists <- outer(seq_len(N), seq_len(N), function(a,b) {
sqrt((final_data[a,"long"] - final_data[a,"lat"])^2 + (final_data[b,"long"] - final_data[b,"lat"])^2)
})
dists
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 162.1127 161.9208 161.7774 165.2982 167.7613 166.7110
# [2,] 161.9208 161.7287 161.5852 165.1101 167.5759 166.5244
# [3,] 161.7774 161.5852 161.4415 164.9694 167.4373 166.3850
# [4,] 165.2982 165.1101 164.9694 168.4235 170.8415 169.8103
# [5,] 167.7613 167.5759 167.4373 170.8415 173.2258 172.2088
# [6,] 166.7110 166.5244 166.3850 169.8103 172.2088 171.1858
### first route
which(apply(perms, 1, identical, c(1L, 3L, 4L, 2L, 6L, 5L)))
# [1] 32
allroutes_total[32]
# [1] 830.5902
### second route
which(apply(perms, 1, identical, c(5L, 4L, 3L, 2L, 1L, 6L)))
# [1] 567
allroutes_total[567]
# [1] 826.028
And if you're curious, your second route was tied for fifth-shortest:
min(allroutes_total)
# [1] 826.0252
which.min(allroutes_total)
# [1] 561
perms[which.min(allroutes_total),]
# [1] 5 4 2 3 1 6
rank(allroutes_total)[567]
# [1] 5.5
I'm not sure that's the right distance calculation, though. I'd think the euclidean distance should be:
dists <- outer(seq_len(N), seq_len(N), function(a,b) {
sqrt((final_data[a,"long"] - final_data[b,"long"])^2 + (final_data[a,"lat"] - final_data[b,"lat"])^2)
})
dists
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 0.0000000 0.3354875 2.682444 3.327741 5.663758 4.718888
# [2,] 0.3354875 0.0000000 2.390565 3.602725 5.909975 4.983694
# [3,] 2.6824442 2.3905652 0.000000 5.100325 6.988631 6.278753
# [4,] 3.3277405 3.6027253 5.100325 0.000000 2.401467 1.402200
# [5,] 5.6637577 5.9099750 6.988631 2.401467 0.000000 1.039848
# [6,] 4.7188885 4.9836936 6.278753 1.402200 1.039848 0.000000
Does anyone know another method for filtering data when there is twice the same ID (Column X) in a data frame but with a different associate value (columns Y)?
Basically I wan to know which rows are in both data frame and after I want to know which row is not in both data frame (Actually I want the value of X and Y of this particular row)
Thank you in advance for your help!
> x <- seq(1:10)
> x[5] <- 4
> y <- (seq.int(1,19,2))
>
> x<- cbind(x,y)
> x
x y
[1,] 1 1
[2,] 2 3
[3,] 3 5
[4,] 4 7
[5,] 4 9
[6,] 6 11
[7,] 7 13
[8,] 8 15
[9,] 9 17
[10,] 10 19
>
> z <- x[1:4,]
> y <- x[6:10,]
>
> z <- rbind(z,y)
> z
x y
[1,] 1 1
[2,] 2 3
[3,] 3 5
[4,] 4 7
[5,] 6 11
[6,] 7 13
[7,] 8 15
[8,] 9 17
[9,] 10 19
>
> df1 <- z[z[,1] %in% x[,1]]
>
> matrix(df1,9,2) # As expected I'm getting 9 rows
[,1] [,2]
[1,] 1 1
[2,] 2 3
[3,] 3 5
[4,] 4 7
[5,] 6 11
[6,] 7 13
[7,] 8 15
[8,] 9 17
[9,] 10 19
>
> # Now I want to know what is the value inside the missing row
> df2 <- z[!z[,1] %in% x[,1]]
>
> matrix(df2,1,2) # I'm getting NA and NA, bu I was expecting the values 4 and 9
[,1] [,2]
[1,] NA NA
To use #hansjaneinvielleicht method:
xlist <- paste(x[,1], x[,2])
zlist <- paste(z[,1], z[,2])
setdiff(xlist, zlist)
# [1] "4 9"
What you're doing here is to filter for values that are not present in x[,1]. However, since 4 is in there, it's also filtered out.
Instead, I assume you'd probably want to work with setdiff method from dplyr (see the doc here)
Then use df2 <- setdiff(x, z)
I am using the cumcount here to adding another key for distinguish the duplicate value in x[,1]
v=ave(x[,1]==x[,1], x[,1], FUN=cumsum)
t=ave(z[,1]==z[,1], z[,1], FUN=cumsum)
df2 <- x[!paste(x[,1],v) %in% paste(z[,1],t)]
matrix(df2,1,2)
[,1] [,2]
[1,] 4 9
x <- data.frame(x)
z <- data.frame(z)
x$from <- "x"
z$from <- "z"
df2 <- merge(x, z, by = c("x", "y"), all.x = T)
df2
# x y from.x from.y
# 1 1 1 x z
# 2 2 3 x z
# 3 3 5 x z
# 4 4 7 x z
# 5 4 9 x <NA>
# 6 6 11 x z
# 7 7 13 x z
# 8 8 15 x z
# 9 9 17 x z
# 10 10 19 x z
df2 <- df2[is.na(df2$from.y),]
df2
# x y from.x from.y
# 5 4 9 x <NA>
Since my real problem was not the one posted since it was too complicated.
Basically, I was not able to apply any solution to my real problem since my real data frames were containing all data types and had a lot of columns.
But I was able to found a solution than work for my real problem but also for the problem posted in the question, so I post the answer than solved my real problem in case it can be useful for someone!
> dup <- which(duplicated(x[,1]) == TRUE)
> ans <- matrix(x[dup,],1,2)
> ans
[,1] [,2]
[1,] 4 9
> # I'm doing this in case the answer was not NA in df2 at the previous step, without
# providing the row "missing"
> df2 <- rbind(df2, ans)
> df2
[,1] [,2]
[1,] 4 9
How to extract every two elements in sequence in a matrix and return the result as a matrix so that I could feed the answer in a formula for calculation:
For example, I have a one row matrix with 6 columns:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
I want to extract column 1 and two in first iteration, 3 and 4 in second iteration and so on. The result has to be in the form of matrix.
[1,] 2 1
[2,] 5 5
[3,] 10 1
My original codes:
data <- matrix(c(1,1,1,2,2,1,2,2,5,5,5,6,10,1,10,2,11,1,11,2), ncol = 2)
Center Matrix:
[,1][,2][,3][,4][,5][,6]
[1,] 2 1 5 5 10 1
[2,] 1 1 2 1 10 1
[3,] 5 5 5 6 11 2
[4,] 2 2 5 5 10 1
[5,] 2 1 5 6 5 5
[6,] 2 2 5 5 11 1
[7,] 2 1 5 5 10 1
[8,] 1 1 5 6 11 1
[9,] 2 1 5 5 10 1
[10,] 5 6 11 1 10 2
objCentroidDist <- function(data, centers) {
resultMatrix <- matrix(NA, nrow=dim(data)[1], ncol=dim(centers)[1])
for(i in 1:nrow(centers)) {
resultMatrix [,i] <- sqrt(rowSums(t(t(data)-centers[i, ])^2))
}
resultMatrix
}
objCentroidDist(data,centers)
I want the Result matrix to be as per below:
[1,][,2][,3]
[1,]
[2,]
[3,]
[4,]
[5,]
[7,]
[8,]
[9,]
[10]
My concern is, how to calculate the data-centers distance if the dimensions of the data matrix are two, and centers matrix are six. (to calculate the distance from the data matrix and every two columns in centers matrix). Each row of the centers matrix has three centers.
Something like this maybe?
m <- matrix(c(2,1,5,5,10,1), ncol = 6)
list.seq.pairs <- lapply(seq(1, ncol(m), 2), function(x) {
m[,c(x, x+1)]
})
> list.seq.pairs
[[1]]
[1] 2 1
[[2]]
[1] 5 5
[[3]]
[1] 10 1
And, in case you're wanting to iterate over multiple rows in a matrix,
you can expand on the above like this:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
apply(mm, 1, function(x) {
lapply(seq(1, length(x), 2), function(y) {
x[c(y, y+1)]
})
})
EDIT:
I'm really not sure what you're after exactly. I think, if you want each row transformed into a 2 x 3 matrix:
mm <- matrix(1:18, ncol = 6, byrow = TRUE)
list.mats <- lapply(1:nrow(mm), function(x){
a = matrix(mm[x,], ncol = 2, byrow = TRUE)
})
> list.mats
[[1]]
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[[2]]
[,1] [,2]
[1,] 7 8
[2,] 9 10
[3,] 11 12
[[3]]
[,1] [,2]
[1,] 13 14
[2,] 15 16
[3,] 17 18
If, however, you want to get to your results matrix- I think it's probably easiest to do whatever calculations you need to do while you're dealing with each row:
results <- t(apply(mm, 1, function(x) {
sapply(seq(1, length(x), 2), function(y) {
val1 = x[y] # Get item one
val2 = x[y+1] # Get item two
val1 / val2 # Do your calculation here
})
}))
> results
[,1] [,2] [,3]
[1,] 0.5000000 0.7500 0.8333333
[2,] 0.8750000 0.9000 0.9166667
[3,] 0.9285714 0.9375 0.9444444
That said, I don't understand what you're trying to do so this may miss the mark. You may have more luck if you ask a new question where you show example input and the actual expected output that you're after, with the actual values you expect.
I have a matrix
mat_a <- matrix(data = c( c(rep(1,3), rep(2,3), rep(3,3))
, rep(seq(1,300,100), 3)
, runif(15, 0, 1))
, ncol=3)
[,1] [,2] [,3]
[1,] 1 1 0.8393401
[2,] 1 101 0.5486805
[3,] 1 201 0.4449259
[4,] 2 1 0.3949137
[5,] 2 101 0.4002575
[6,] 2 201 0.3288861
[7,] 3 1 0.7865035
[8,] 3 101 0.2581155
[9,] 3 201 0.8987769
that I compare to another matrix with higher dimensions
mat_b <- matrix(data = c(
c(rep(1,3), rep(2,3), rep(3,3), rep(4,3))
, rep(seq(1,300,100), 4)
, rep(3:5, 4))
, ncol = 3)
[,1] [,2] [,3]
[1,] 1 1 3
[2,] 1 101 4
[3,] 1 201 5
[4,] 2 1 3
[5,] 2 101 4
[6,] 2 201 5
[7,] 3 1 3
[8,] 3 101 4
[9,] 3 201 5
[10,] 4 1 3
[11,] 4 101 4
[12,] 4 201 5
I need to extract the lines of mat_a where columns #2 of both matrices match. For those matches, both columns 1 also have to match. Also, column 3 of mat_b must be higher or equal to 4.
I cannot find any solution based on vectorization. I only came out with a loop-based solution.
output <- NULL
for (i in 1:nrow(mat_a)) {
if (mat_a[i,2] %in% mat_b[,2][mat_b[,3] >= 4]) {
rows <- which( mat_b[,2] %in% mat_a[i,2])
row <- which(mat_b[,1][rows] == mat_a[i,1])
if (mat_b[,3][rows[row]] >= 4) {
output <- rbind(output, mat_a[i,])
}
}
}
This works but is extremely slow. It took less than one hour to run. mat_a has 9 col with 40,000 rows (could go higher), mat_b has 5 col and around 1.2 millions rows.
Any idea?
It is better to work with data frames when comparing tables as you are. That will use R's structures to their strengths instead of working against them. We use a simple merge to match the correct values. Then subset b with the necessary condition, b$V3 >= 4. On the end, [-4] lets the output more closely match your desired output:
a <- as.data.frame(mat_a)
b <- as.data.frame(mat_b)
merge(a,b[b$V3 >= 4,], by=c("V1","V2"))[-4]
# V1 V2 V3.x
# 1 1 101 0.1118960
# 2 1 201 0.1543351
# 3 2 101 0.3950491
# 4 2 201 0.5688684
# 5 3 201 0.4749941
I am looking for a more versatile way to get from a data.frame to a multidimensional array.
I would like to be able to create as many dimensions as needed from as many variables in the data frame as desired.
Currently, the method has to be tailored to each data.frame, requires subletting to form a vector.
I would love something along the melt/cast methods in plyr.
data<-data.frame(coord.name=rep(1:10, 2),
x=rnorm(20),
y=rnorm(20),
ID=rep(c("A","B"), each=10))
data.array<-array(dim=c(10, 2, length(unique(data$ID))))
for(i in 1:length(unique(data$ID))){
data.array[,1,i]<-data[data$ID==unique(data$ID)[i],"x"]
data.array[,2,i]<-data[data$ID==unique(data$ID)[i],"y"]
}
data.array
, , 1
[,1] [,2]
[1,] 1 1
[2,] 3 3
[3,] 5 5
[4,] 7 7
[5,] 9 9
[6,] 1 1
[7,] 3 3
[8,] 5 5
[9,] 7 7
[10,] 9 9
, , 2
[,1] [,2]
[1,] 2 2
[2,] 4 4
[3,] 6 6
[4,] 8 8
[5,] 10 10
[6,] 2 2
[7,] 4 4
[8,] 6 6
[9,] 8 8
[10,] 10 10
You may have had trouble applying the reshape2 functions for a somewhat subtle reason. The difficulty was that your data.frame has no column that can be used to direct how you want to arrange the elements along the first dimension of an output array.
Below, I explicitly add such a column, calling it "row". With it in place, you can use the expressive acast() or dcast() functions to reshape the data in any way you choose.
library(reshape2)
# Use this or some other method to add a column of row indices.
data$row <- with(data, ave(ID==ID, ID, FUN = cumsum))
m <- melt(data, id.vars = c("row", "ID"))
a <- acast(m, row ~ variable ~ ID)
a[1:3, , ]
# , , A
#
# x y
# 1 1 1
# 2 3 3
# 3 5 5
#
# , , B
#
# x y
# 1 2 2
# 2 4 4
# 3 6 6
I think this is right:
array(unlist(lapply(split(data, data$ID), function(x) as.matrix(x[ , c("x", "y")]))), c(10, 2, 2))