I want to code travelling salesman problem in R. I am going to begin with 3 cities at first then I will expand to more cities. distance matrix below gives distance between 3 cities. Objective (if someone doesn't know) is that a salesman will start from a city and will visit 2 other cities such that he has to travel minimum distance.
In below case he should start either from ny or LA and then travel to chicago and then to the remaining city. I need help to define A_ (my constraint matrix).
My decision variables will of same dimension as distances matrix. It will be a 1,0 matrix where 1 represents travel from city equal to row name to a city equal to column name. For instance if a salesman travels from ny to chicago, 2nd element in row 1 will be 1. My column and row names are ny,chicago and LA
By looking at the solution of the problem I concluded that my constraints will be::
Row sums have to be less than 1 as he cannot leave from same city twice
Column sums have to be less than 1 as he cannot enter the same city twice
total sum of matrix elements has to be 2 as the salesman will be visiting 2 cities and leaving from 2 cities.
I need help to define A_ (my constraint matrix). How should I tie in my decision variables into constraints?
ny=c(999,9,20)
chicago=c(9,999,11)
LA=c(20,11,999)
distances=cbind(ny,chicago,LA)
dv=matrix(c("a11","a12","a13","a21","a22","a23","a31","a32","a33"),nrow=3,ncol=3)
c_=c(distances[1,],distances[2,],distances[3,])
signs = c((rep('<=', 7)))
b=c(1,1,1,1,1,1,2)
res = lpSolve::lp('min', c_, A_, signs, b, all.bin = TRUE)
There are some problems with your solution. The first is that the constraints you have in mind don't guarantee that all the cities will be visited -- for example, the path could just go from NY to LA and then back. This could be solved fairly easily, for example, by requiring that each row and column sum to exactly one rather than at most 1 (although in that case you'd be finding a traveling salesman tour rather than just a path).
The bigger problem is that, even if we fix this problem, your constraints wouldn't guarantee that the selected vertices actually form one cycle through the graph, rather than multiple smaller cycles. And I don't think that your representation of the problem can be made to address this issue.
Here is an implementation of Travelling Salesman using LP. The solution space is of size n^3, where n is the number of rows in the distance matrix. This represents n consecutive copies of the nxn matrix, each of which represents the edge traversed at time t for 1<=t<=n. The constraints guarantee that
At most one edge is traversed each step
Ever vertex is visited exactly once
The startpoint of the i'th edge traversed is the same as the endpoint of the i-1'st
This avoids the problem of multiple small cycles. For example, with four vertices, the sequence (12)(21)(34)(43) would not be a valid solution because the endpoint of the second edge (21) does not match the start point of the third (34).
tspsolve<-function(x){
diag(x)<-1e10
## define some basic constants
nx<-nrow(x)
lx<-length(x)
objective<-matrix(x,lx,nx)
rowNum<-rep(row(x),nx)
colNum<-rep(col(x),nx)
stepNum<-rep(1:nx,each=lx)
## these constraints ensure that at most one edge is traversed each step
onePerStep.con<-do.call(cbind,lapply(1:nx,function(i) 1*(stepNum==i)))
onePerRow.rhs<-rep(1,nx)
## these constraints ensure that each vertex is visited exactly once
onceEach.con<-do.call(cbind,lapply(1:nx,function(i) 1*(rowNum==i)))
onceEach.rhs<-rep(1,nx)
## these constraints ensure that the start point of the i'th edge
## is equal to the endpoint of the (i-1)'st edge
edge.con<-c()
for(s in 1:nx){
s1<-(s %% nx)+1
stepMask<-(stepNum == s)*1
nextStepMask<- -(stepNum== s1)
for(i in 1:nx){
edge.con<-cbind(edge.con,stepMask * (colNum==i) + nextStepMask*(rowNum==i))
}
}
edge.rhs<-rep(0,ncol(edge.con))
## now bind all the constraints together, along with right-hand sides, and signs
constraints<-cbind(onePerStep.con,onceEach.con,edge.con)
rhs<-c(onePerRow.rhs,onceEach.rhs,edge.rhs)
signs<-rep("==",length(rhs))
list(constraints,rhs)
## call the lp solver
res<-lp("min",objective,constraints,signs,rhs,transpose=F,all.bin=T)
## print the output of lp
print(res)
## return the results as a sequence of vertices, and the score = total cycle length
list(cycle=colNum[res$solution==1],score=res$objval)
}
Here is an example:
set.seed(123)
x<-matrix(runif(16),c(4,4))
x
## [,1] [,2] [,3] [,4]
## [1,] 0.2875775 0.9404673 0.5514350 0.6775706
## [2,] 0.7883051 0.0455565 0.4566147 0.5726334
## [3,] 0.4089769 0.5281055 0.9568333 0.1029247
## [4,] 0.8830174 0.8924190 0.4533342 0.8998250
tspsolve(x)
## Success: the objective function is 2.335084
## $cycle
## [1] 1 3 4 2
##
## $score
## [1] 2.335084
We can check the correctness of this answer by using a primitive brute force search:
tspscore<-function(x,solution){
sum(sapply(1:nrow(x), function(i) x[solution[i],solution[(i%%nrow(x))+1]]))
}
tspbrute<-function(x,trials){
score<-Inf
cycle<-c()
nx<-nrow(x)
for(i in 1:trials){
temp<-sample(nx)
tempscore<-tspscore(x,temp)
if(tempscore<score){
score<-tempscore
cycle<-temp
}
}
list(cycle=cycle,score=score)
}
tspbrute(x,100)
## $cycle
## [1] 3 4 2 1
##
## $score
## [1] 2.335084
Note that, even though these answers are nominally different, they represent the same cycle.
For larger graphs, though, the brute force approach doesn't work:
> set.seed(123)
> x<-matrix(runif(100),10,10)
> tspsolve(x)
Success: the objective function is 1.296656
$cycle
[1] 1 10 3 9 5 4 8 2 7 6
$score
[1] 1.296656
> tspbrute(x,1000)
$cycle
[1] 1 5 4 8 10 9 2 7 6 3
$score
[1] 2.104487
This implementation is pretty efficient for small matrices, but, as expected, it starts to deteriorate severely as they get larger. At about 15x15 it starts slowing down quite a bit:
timetsp<-function(x,seed=123){
set.seed(seed)
m<-matrix(runif(x*x),x,x)
gc()
system.time(tspsolve(m))[3]
}
sapply(6:16,timetsp)
## elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed
## 0.011 0.010 0.018 0.153 0.058 0.252 0.984 0.404 1.984 20.003
## elapsed
## 5.565
You can use the gaoptim package to solve permutation/real valued problems - it's pure R, so it's not so fast:
Euro tour problem (see ?optim)
eurodistmat = as.matrix(eurodist)
# Fitness function (we'll perform a maximization, so invert it)
distance = function(sq)
{
sq = c(sq, sq[1])
sq2 <- embed(sq, 2)
1/sum(eurodistmat[cbind(sq2[,2], sq2[,1])])
}
loc = -cmdscale(eurodist, add = TRUE)$points
x = loc[, 1]
y = loc[, 2]
n = nrow(eurodistmat)
set.seed(1)
# solving code
require(gaoptim)
ga2 = GAPerm(distance, n, popSize = 100, mutRate = 0.3)
ga2$evolve(200)
best = ga2$bestIndividual()
# solving code
# just transform and plot the results
best = c(best, best[1])
best.dist = 1/max(ga2$bestFit())
res = loc[best, ]
i = 1:n
plot(x, y, type = 'n', axes = FALSE, ylab = '', xlab = '')
title ('Euro tour: TSP with 21 cities')
mtext(paste('Best distance found:', best.dist))
arrows(res[i, 1], res[i, 2], res[i + 1, 1], res[i + 1, 2], col = 'red', angle = 10)
text(x, y, labels(eurodist), cex = 0.8, col = 'gray20')
Related
With an igraph object I would like to capture some features of each node's neighbours, for example the average degree of its neighbours.
I come up with this code, which is inelegant and quite slow.
How should I rethink it for large and complex networks?
library(igraph)
# Toy example
set.seed(123)
g <- erdos.renyi.game(10,0.2)
# Loop to calculate average degree of each node's neighbourhood
s <- character(0)
for(i in 1:gorder(g)){
n <- ego_size(g, nodes = i, order = 1, mindist = 1)
node_of_interest <- unique(unlist(ego(g, nodes = i, order = 1, mindist = 1)))
m <- mean(degree(g, v = node_of_interest, loops = TRUE, normalized = FALSE)-1)
s <- rbind(s,data.frame(node = i, neighbours = n, mean = m))
}
Expanding the data structure with rbind in a loop can get quite slow in R, because at every step it needs to allocate the space for the new object, and then copy it (see section 24.6 here). Also, you might be computing the degree of a node many times, if it s the neighbor of multiple nodes.
A possibly better alternative could be:
# add vertex id (not really necessary)
V(g)$name <- V(g)
# add degree to the graph
V(g)$degree <- degree(g, loops = TRUE, normalized = FALSE)
# get a list of neighbours, for each node
g_ngh <- neighborhood(g, mindist = 1)
# write a function that gets the means
get.mean <- function(x){
mean(V(g)$degree[x]-1)
}
# apply the function, add result to the graph
V(g)$av_degr_nei <- sapply(g_ngh, get.mean)
# get data into dataframe, if necessary
d_vert_attr <- as_data_frame(g, what = "vertices")
d_vert_attr
name degree av_degr_nei
1 1 0 NaN
2 2 1 2.0000000
3 3 2 1.0000000
4 4 1 1.0000000
5 5 2 1.0000000
6 6 1 1.0000000
7 7 3 0.6666667
8 8 1 0.0000000
9 9 1 0.0000000
10 10 0 NaN
I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.
This process will go on for all 48k patients.
I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a
First things first: You can find more rigorous treatments of cosine similarity at either of these posts:
Find cosine similarity between two arrays
Creating co-occurrence matrix
Now, you clearly have a mixture of data types in your input, at least
decimal
integer
categorical
I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.
Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:
diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min
I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like
diasbp_rage <- diasbp_max - diasbp_min
Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!
After calculating the similarity, I try to find two different patients with the most dissimilar encounters.
Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.
Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
which gives
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine.
So what I'm trying to say is... why wouldn't you use lsa::cosine?
Maybe you should be more concerned with
selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control
as previously addressed
This is the first time I am working with graphs and R igraph package and I need some help with processing graph objects.
What I want to achieve:
From a given contact matrix extract shortest confident path between nodes. By confident I mean that edge weights are higher then neighbouring edges.
Examples:
A
m <- read.table(row.names = 1, header = TRUE, text =
" A B C D E F
A 0 1 1 1 1 5
B 1 0 1 1e2 1e2 1
C 1 1 0 1 1 1
D 1 1e2 1 0 1e2 1
E 1 1e2 1 1e2 0 1
F 5 1 1 1 1 0")
m <- as.matrix(m)
ig <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
sp <- shortest.paths(ig, algorithm = "dijkstra")
In matrix m there is one cluster (clique?) between B-D-E (ie., egde weights between those nodes are high). However, as there is weight between A and F I am also getting cluster there, even though edge weight is low (only 5).
Question A: How to extract only those clusters that have high edge weight? I can transform those contacts to 0 with m[which(m <= 5)] <- 0, but I hope that there is more "mathy" solution for this in igraph package.
B
m <- read.table(row.names = 1, header = TRUE, text =
" A B C D E F
A 0 1 1 5 1 1
B 1 0 1 1e2 1e2 1
C 1 1 0 1 1 1
D 5 1e2 1 0 1e2 1
E 1 1e2 1 1e2 0 1
F 1 1 1 1 1 0")
m <- as.matrix(m)
ig <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
sp <- shortest.paths(ig, algorithm = "dijkstra")
In matrix m there is cluster between B-D-E, but as there is low weight between A and B - A is also connected to this cluster.
Question B: How to not assign nodes to a cluster if edge weight is low?
This is my first question here, if you need clarification or better examples I will improve my questions.
First, it is good to know that when looking up paths, igraph understands weights as costs, i.e. on edges with higher weight it costs more to travel, so it will consider shorter the paths with lower sum weight. It is easy to turn this into the opposite, just take the reciprocal of your weights (1 / E(ig)$weight). Between 2 vertices there might be only one shortest path, but sometimes there are more equally short paths. You can look up all of them (all_shortest_paths), or tell igraph to return only one of the shortests for each pairs of vertices (shortest_paths). Each call of these methods returns the paths from one selected vertex, to have the paths between all pairs, you need to call these once for each vertex (ok, at an undirected graph, it is enough to call for half of the vertices). To formulate what I explained until this point:
spaths <- lapply(V(ig),
function(v){
all_shortest_paths(ig, v,
weight = 1 / E(ig)$weight
)
}
)
Here spaths will be a list of lists, access the paths from C to all the vertices like this:
spaths$C$res
[[1]]
+ 2/6 vertices, named:
[1] C A
[[2]]
+ 2/6 vertices, named:
[1] C B
[[3]]
+ 1/6 vertex, named:
[1] C
[[4]]
+ 2/6 vertices, named:
[1] C D
[[5]]
+ 2/6 vertices, named:
[1] C E
[[6]]
+ 2/6 vertices, named:
[1] C F
spaths$C$res[[2]] # this is the path from `C` to `B`,
# a vector of 2 vertices
Note, the third element is actually from C to itself, you can either ignore it, or provide a vector of all other vertices to the to parameter of all_shortest_paths. Also, in your example all shortest paths will be of length 1, but if I set for example the weight of B--E to 1 instead of 100, we see that the method works, and from B to E the shortest path will be B-D-E.
Regarding your second question, here it is not completely clear what do you want to achieve, especially how do you get these clusters? If you want to find communities, i.e. more closely connected group of vertices, taking into account also the edge weights, there are many methods for this, all those named cluster_[...] or community.[...] in igraph. For example, if we run the fastgreedy method on your graph, it will detect the cluster you mentioned:
fg <- fastgreedy.community(ig, weights = E(ig)$weight)
IGRAPH clustering fast greedy, groups: 2, mod: 0.059
+ groups:
$`1`
[1] "A" "C" "F"
$`2`
[1] "B" "D" "E"
So here we have the B, D, E cluster, what is connected with higher weight edges. If we run the same method without weights, all the vertices will belong to one group (fastgreedy.community(ig, weights = NULL)). Note, at community detection, igraph understands weights as strength, so vertices connected with higher weight edges more likely to cluster together, this is kind of opposite like how it works at calculating paths.
I have a simple (indeed standard in economics) nonlinear constrained discrete maximisation problem to solve in R and am having trouble. I found solutions for parts of the problem (nonlinear maximisation; discrete maximisation) but not for the union of all the problems.
Here is the problem. A consumer wants to buy three products (ananas, banana, cookie), knows the prices and has a budget of 20€. He likes variety (i.e., he wants to have all three products if possible) and his satisfaction is decreasing in the amount consumed (he likes his first cookie way more than his 100th).
The function he wishes to maximise is
and of course since each has a price, and he has a limited budget, he maximises this function under the constraint that
What I want to do is to find the optimal buying list (N ananas, M bananas, K cookies) that satisfies the constraint.
If the problem were linear, I would simply use linprog::solveLP(). But the objective function is nonlinear.
If the problem were of a continuous nature, ther would be a simple analytic solution to it.
The question being discrete and nonlinear, I do not know how to proceed.
Here is some toy data to play with.
df <- data.frame(rbind(c("ananas",2.17),c("banana",0.75),c("cookie",1.34)))
names(df) <- c("product","price")
I'd like to have an optimization routine that gives me an optimal buying list of (N,M,K).
Any hints?
1) no packages This can be done by brute force. Using df from the question as input ensure that price is numeric (it's a factor in the df of the question) and calculate the largest number mx for each variable. Then create grid g of variable counts and compute the total price of each and the associated objective giving gg. Now sort gg in descending order of objective and take those solutions satisfying the constraint. head will show the top few solutions.
price <- as.numeric(as.character(df$price))
mx <- ceiling(20/price)
g <- expand.grid(ana = 0:mx[1], ban = 0:mx[2], cook = 0:mx[3])
gg <- transform(g, total = as.matrix(g) %*% price, objective = sqrt(ana * ban * cook))
best <- subset(gg[order(-gg$objective), ], total <= 20)
giving:
> head(best) # 1st row is best soln, 2nd row is next best, etc.
ana ban cook total objective
1643 3 9 5 19.96 11.61895
1929 3 7 6 19.80 11.22497
1346 3 10 4 19.37 10.95445
1611 4 6 5 19.88 10.95445
1632 3 8 5 19.21 10.95445
1961 2 10 6 19.88 10.95445
2) dplyr This can also be nicely expressed using the dplyr package. Using g and price from above:
library(dplyr)
g %>%
mutate(total = c(as.matrix(g) %*% price), objective = sqrt(ana * ban * cook)) %>%
filter(total <= 20) %>%
arrange(desc(objective)) %>%
top_n(6)
giving:
Selecting by objective
ana ban cook total objective
1 3 9 5 19.96 11.61895
2 3 7 6 19.80 11.22497
3 3 10 4 19.37 10.95445
4 4 6 5 19.88 10.95445
5 3 8 5 19.21 10.95445
6 2 10 6 19.88 10.95445
If you do not mind using a "by hand" solution:
uf=function(x)prod(x)^.5
bf=function(x,pr){
if(!is.null(dim(x)))apply(x,1,bf,pr) else x%*%pr
}
budget=20
df <- data.frame(product=c("ananas","banana","cookie"),
price=c(2.17,0.75,1.34),stringsAsFactors = F)
an=0:(budget/df$price[1]) #include 0 for all possibilities
bn=0:(budget/df$price[2])
co=0:(budget/df$price[3])
X=expand.grid(an,bn,co)
colnames(X)=df$product
EX=apply(X,1,bf,pr=df$price)
psX=X[which(EX<=budget),] #1st restrict
psX=psX[apply(psX,1,function(z)sum(z==0))==0,] #2nd restrict
Ux=apply(psX,1,uf)
cbind(psX,Ux)
(sol=psX[which.max(Ux),])
uf(sol) # utility
bf(sol,df$price) #budget
> (sol=psX[which.max(Ux),])
ananas banana cookie
1444 3 9 5
> uf(sol) # utility
[1] 11.61895
> bf(sol,df$price) #budget
1444
19.96
I think this problem is very similar in nature to this question (Solve indeterminate equation system in R). The answer by Richie Cotton was the basis to this possible solution:
df <- data.frame(product=c("ananas","banana","cookie"),
price=c(2.17,0.75,1.34),stringsAsFactors = F)
FUN <- function(w, price=df$price){
total <- sum(price * w)
errs <- c((total-20)^2, -(sqrt(w[1]) * sqrt(w[2]) * sqrt(w[3])))
sum(errs)
}
init_w <- rep(10,3)
res <- optim(init_w, FUN, lower=rep(0,3), method="L-BFGS-B")
res
res$par # 3.140093 9.085182 5.085095
sum(res$par*df$price) # 20.44192
Notice that the total cost (i.e. price) for the solution is $ 20.44. To solve this problem, we can weight the error terms to put more emphasis on the 1st term, which relates to the total cost:
### weighting of error terms
FUN2 <- function(w, price=df$price){
total <- sum(price * w)
errs <- c(100*(total-20)^2, -(sqrt(w[1]) * sqrt(w[2]) * sqrt(w[3]))) # 1st term weighted by 100
sum(errs)
}
init_w <- rep(10,3)
res <- optim(init_w, FUN2, lower=rep(0,3), method="L-BFGS-B")
res
res$par # 3.072868 8.890832 4.976212
sum(res$par*df$price) # 20.00437
As LyzandeR remarked there is no nonlinear integer programming solver available in R. Instead, you can use the R package rneos that sends data to one of the NEOS solvers and returns the results into your R process.
Select one of the solvers for "Mixed Integer Nonlinearly Constrained Optimization" on the NEOS Solvers page, e.g., Bonmin or Couenne. For your example above, send the following files in the AMPL modeling language to one of these solvers:
[Note that maximizing the product x1 * x2 * x3 is the same as maximising the product sqrt(x1) * sort(x2) * sqrt(x3).]
Model file:
param p{i in 1..3};
var x{i in 1..3} integer >= 1;
maximize profit: x[1] * x[2] * x[3];
subject to restr: sum{i in 1..3} p[i] * x[i] <= 20;
Data file:
param p:= 1 2.17 2 0.75 3 1.34 ;
Command file:
solve;
display x;
and you will receive the following solution:
x [*] :=
1 3
2 9
3 5
;
This approach will work for more extended examples were solutions "by hand" are not reasonable and rounded optim solutions are not correct.
To look at a more demanding example, let me propose the following problem:
Find an integer vector x = (x_i), i=1,...,10, that maximizes x1 * ... * x10, such that p1*x1 + ... + p10*x10 <= 10, where p = (p_i), i=1,...,10, is the following price vector
p <- c(0.85, 0.22, 0.65, 0.73, 0.91, 0.11, 0.31, 0.47, 0.93, 0.71)
Using constrOptim for this nonlinear optimization problem with a linear inequality constraint, I get solutions like 900 for different starting points, but never the optimal solutions that is 960 !
I have a dataset of species and their rough locations in a 100 x 200 meter area. The location part of the data frame is not in a format that I find to be usable. In this 100 x 200 meter rectangle, there are two hundred 10 x 10 meter squares named A through CV. Within each 10 x 10 square there are four 5 x 5 meter squares named 1, 2, 3, and 4, respectively (1 is south of 2 and west of 3. 4 is east of 2 and north of 3). I want to let R know that A is the square with corners at (0 ,0), (10,0), (0,0), and (0,10), that B is just north of A and has corners (0,10), (0,20), (10,10), and (10,20), and K is just east of A and has corners at (10,0), (10,10), (20,0), and (20,10), and so on for all the 10 x 10 meter squares. Additionally, I want to let R know where each 5 x 5 meter square is in the 100 x 200 meter plot.
So, my data frame looks something like this
10x10 5x5 Tree Diameter
A 1 tree1 4
B 1 tree2 4
C 4 tree3 6
D 3 tree4 2
E 3 tree5 3
F 2 tree6 7
G 1 tree7 12
H 2 tree8 1
I 2 tree9 2
J 3 tree10 8
K 4 tree11 3
L 1 tree12 7
M 2 tree13 5
Eventually, I want to be able to plot the 100 x 200 meter area and have each 10 x 10 meter square show up with the number of trees, or number of species, or total biomass
What is the best way to turn the data I have into spatial data that R can use for graphing and perhaps analysis?
Here's a start.
## set up a vector of all 10x10 position tags
tags10 <- c(LETTERS,
paste0("A",LETTERS),
paste0("B",LETTERS),
paste0("C",LETTERS[1:22]))
A function to convert (e.g.) {"J",3} to the center of the corresponding sub-square.
convpos <- function(pos10,pos5) {
## convert letters to major (x,y) positions
p1 <- as.numeric(factor(pos10,levels=tags10)) ## or use match()
p1.x <- ((p1-1) %% 10) *10+5 ## %% is modulo operator
p1.y <- ((p1-1) %/% 10)*10+5 ## %/% is integer division
## sort out sub-positions
p2.x <- ifelse(pos5 <=2,2.5,7.5) ## {1,2} vs {3,4} values
p2.y <- ifelse(pos5 %%2 ==1 ,2.5,7.5) ## odd {1,3} vs even {2,4} values
c(p1.x+p2.x,p1.y+p2.y)
}
usage:
convpos("J",2)
convpos(mydata$tenbytenpos,mydata$fivebyfivepos)
Important notes:
this is a proof of concept, I can pretty much guarantee I haven't got the correspondence of x and y coordinates quite right. But you should be able to trace through this line-by-line and see what it's doing ...
it should work correctly on vectors (see second usage example above): I switched from switch to ifelse for that reason
your column names (10x10) are likely to get mangled into something like X10.10 when reading data into R: see ?data.frame and ?check.names
Similar to what #Ben Bolker has done, here's a lookup function (though you may need to transpose something to make the labels match what you describe).
tenbyten <- c(LETTERS[1:26],
paste0("A",LETTERS[1:26]),
paste0("B",LETTERS[1:26]),
paste0("C",LETTERS[1:22]))
tenbyten <- matrix(rep(tenbyten, each = 2), ncol = 10)
tenbyten <- t(apply(tenbyten, 1, function(x){rep(x, each = 2)}))
# the 1234 squares
squares <- matrix(c(rep(c(1,2),10),rep(c(4,3),10)), nrow = 20, ncol = 20)
# stick together into a reference grid
my.grid <- matrix(paste(tenbyten, squares, sep = "-"), nrow = 20, ncol = 20)
# a lookup function for the site grid
coordLookup <- function(tbt, fbf, .my.grid = my.grid){
x <- col(.my.grid) * 5 - 2.5
y <- row(.my.grid) * 5 - 2.5
marker <- .my.grid == paste(tbt, fbf, sep = "-")
list(x = x[marker], y = y[marker])
}
coordLookup("BB",2)
$x
[1] 52.5
$y
[1] 37.5
If this isn't what you're looking for, then maybe you'd prefer a SpatialPolygonsDataFrame, which has proper polygon IDs, and you attach data to, etc. In that case just Google around for how to make one from scratch, and manipulate the row() and col() functions to get your polygon corners, similar to what's given in this lookup function, which only returns centroids.
Edit: getting SPDF started:
This is modified from the function example and can hopefully be a good start:
library(sp)
# really you have a 20x20 grid, counting the small ones.
# c(2.5,2.5) specifies the distance in any direction from the cell center
grd <- GridTopology(c(1,1), c(2.5,2.5), c(20,20)))
grd <- as.SpatialPolygons.GridTopology(grd)
# get centroids
coords <- coordinates(polys)
# make SPDF, with an extra column for your grid codes, taken from the above.
# you can add further columns to this data.frame(), using polys#data
polys <- SpatialPolygonsDataFrame(grd,
data=data.frame(x=coords[,1], y=coords[,2], my.ID = as.vector(my.grid),
row.names=getSpPPolygonsIDSlots(grd)))