There are several posts on computing pairwise differences among vectors, but I cannot find how to compute all differences within a vector.
Say I have a vector, v.
v<-c(1:4)
I would like to generate a second vector that is the absolute value of all pairwise differences within the vector. Similar to:
abs(1-2) = 1
abs(1-3) = 2
abs(1-4) = 3
abs(2-3) = 1
abs(2-4) = 2
abs(3-4) = 1
The output would be a vector of 6 values, which are the result of my 6 comparisons:
output<- c(1,2,3,1,2,1)
Is there a function in R that can do this?
as.numeric(dist(v))
seems to work; it treats v as a column matrix and computes the Euclidean distance between rows, which in this case is sqrt((x-y)^2)=abs(x-y)
If we're golfing, then I'll offer c(dist(v)), which is equivalent and which I'm guessing will be unbeatable.
#AndreyShabalin makes the good point that using method="manhattan" will probably be slightly more efficient since it avoids the squaring/square-rooting stuff.
Let's play golf
abs(apply(combn(1:4,2), 2, diff))
#Ben, yours is a killer!
> system.time(apply(combn(1:1000,2), 2, diff))
user system elapsed
6.65 0.00 6.67
> system.time(c(dist(1:1000)))
user system elapsed
0.02 0.00 0.01
> system.time({
+ v <- 1:1000
+ z = outer(v,v,'-');
+ z[lower.tri(z)];
+ })
user system elapsed
0.03 0.00 0.03
Who knew that elegant (read understandable/flexible) code can be so slow.
A possible solution is:
z = outer(v,v,'-');
z[lower.tri(z)];
[1] 1 2 3 1 2 1
Related
Is it possible to get the multiple comparison adjustment in pairwise.prop.test() to use less than the full number of comparisons? For example, if I only care about 4 vs 1,2,3 (3 comparisons) below, I would multiply the p-values in the bottom row by 3 instead of 6 (which is the full number of pairwise comparisons) to do the Bonferroni adjustment. p.adjust has the n argument, but I can't figure out how to pass it through by doing something like
pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="bonferroni", p.adjust.n = 3, alternative="two.sided", correct = FALSE)
With Bonferroni, it's trivial, but much more involved with the other types of corrections.
Here's the result (with code below):
> b <- data.frame(
+ s=c(18,53,49,30),
+ pop=c(29,100,88,73),
+ reg=c("1","2","3","4")
+ )
> pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="none",alternative="two.sided", correct = FALSE)
Pairwise comparisons using Pairwise comparison of proportions
data: b$s out of b$pop
1 2 3
2 0.387 - -
3 0.547 0.713 -
4 0.056 0.122 0.065
P value adjustment method: none
> pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="bonferroni", alternative="two.sided", correct = FALSE)
Pairwise comparisons using Pairwise comparison of proportions
data: b$s out of b$pop
1 2 3
2 1.00 - -
3 1.00 1.00 -
4 0.33 0.73 0.39
P value adjustment method: bonferroni
Code:
b <- data.frame(
s=c(18,53,49,30),
pop=c(29,100,88,73),
reg=c("1","2","3","4")
)
pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="none",alternative="two.sided", correct = FALSE)
pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="bonferroni", alternative="two.sided", correct = FALSE)
Based on #27 ϕ 9's comment:
b <- data.frame(
s=c(18,53,49,30),
pop=c(29,100,88,73),
reg=c("1","2","3","4")
)
unadj <- pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="none",alternative="two.sided")
p.adjust(unadj$p.value[3, ], method = "holm")
pairwise.prop.test(x=b$s,n=b$pop,p.adjust.method="bonferroni", alternative="two.sided")
I have a simple task to do. I have a 3D array (10,1350,1280) and I want to calculate the min over the first dimensions. I can do it using aaply like the following
minObs <- plyr::aaply(obs, c(2,3), min) # min of observation
But it is extremely slow compared to when I just write a nested loop.
minObs<-matrix(nrow=dim(obs)[2],ncol=dim(obs)[3])
for (i in 1:dim(obs)[2]){
for (j in 1:dim(obs)[3]){
minObs[i,j]<-min(obs[,i,j],na.rm = TRUE)
}
}
I am new to R , but I am guessing that I am doing something wrong with aaply function. And hint would be very much appreciated. How can I speed up using aaply?
Why not just use the base apply function?
apply(obs, c(2,3), min)
It's fast, doesn't require loading an additional package and gives the same result, as per:
all.equal(
apply(obs, 2:3, min),
aaply(obs, 2:3, min), check.attributes=FALSE)
#[1] TRUE
Timings using system.time() using a 10 x 1350 x 1280 array:
Loop
# user system elapsed
# 3.79 0.00 3.79
Base apply()
# user system elapsed
# 2.87 0.02 2.89
plyr::aaply()
#Timing stopped at: 122.1 0.04 122.24
Given a column of data (of the type 39600.432, 39600.433, etc) I would like to drop the integer part of the number and keep only the decimals (transforming 39600.432 into 432, and 39600.433 into 433). How can I do this?
Let's say your column is the vector x.
> x <- c(39.456, 976.902)
> x <- x - as.integer(x)
> x
[1] 0.456 0.902
That should work. You can then just multiply by 1000 to convert the current x to integers. You will need some more processing if you want 3.9 to become 9.
> x <- 1000*x
> x
[1] 456 902
Hope the helps!
Many good answers, here's one more using regular expressions.
> g <- c(134.3412,14234.5453)
> gsub("^[^\\.]*\\.", "", g)
[1] "3412" "5453"
To strip the integral part without a subtraction or regex, you can use the modulus operator.
x <- (10000:10010)/100
x
## [1] 100.00 100.01 100.02 100.03 100.04 100.05 100.06 100.07 100.08 100.09 100.10
x %% 1
## [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
%% 1 is meaningful in R. This does leave the value as fractional, which may not be ideal for your use.
You are looking for the floor function. But you could do as.integer as well.
Here is an approach using regular expressions
g<-c(134.3412,14234.5453)
r<-regexpr("[0-9]+$",g)
as.numeric(regmatches(g,r))
This should do it:
g <- c(134.3412,14234.5453)
h <- floor(g)
g - h
I want to code travelling salesman problem in R. I am going to begin with 3 cities at first then I will expand to more cities. distance matrix below gives distance between 3 cities. Objective (if someone doesn't know) is that a salesman will start from a city and will visit 2 other cities such that he has to travel minimum distance.
In below case he should start either from ny or LA and then travel to chicago and then to the remaining city. I need help to define A_ (my constraint matrix).
My decision variables will of same dimension as distances matrix. It will be a 1,0 matrix where 1 represents travel from city equal to row name to a city equal to column name. For instance if a salesman travels from ny to chicago, 2nd element in row 1 will be 1. My column and row names are ny,chicago and LA
By looking at the solution of the problem I concluded that my constraints will be::
Row sums have to be less than 1 as he cannot leave from same city twice
Column sums have to be less than 1 as he cannot enter the same city twice
total sum of matrix elements has to be 2 as the salesman will be visiting 2 cities and leaving from 2 cities.
I need help to define A_ (my constraint matrix). How should I tie in my decision variables into constraints?
ny=c(999,9,20)
chicago=c(9,999,11)
LA=c(20,11,999)
distances=cbind(ny,chicago,LA)
dv=matrix(c("a11","a12","a13","a21","a22","a23","a31","a32","a33"),nrow=3,ncol=3)
c_=c(distances[1,],distances[2,],distances[3,])
signs = c((rep('<=', 7)))
b=c(1,1,1,1,1,1,2)
res = lpSolve::lp('min', c_, A_, signs, b, all.bin = TRUE)
There are some problems with your solution. The first is that the constraints you have in mind don't guarantee that all the cities will be visited -- for example, the path could just go from NY to LA and then back. This could be solved fairly easily, for example, by requiring that each row and column sum to exactly one rather than at most 1 (although in that case you'd be finding a traveling salesman tour rather than just a path).
The bigger problem is that, even if we fix this problem, your constraints wouldn't guarantee that the selected vertices actually form one cycle through the graph, rather than multiple smaller cycles. And I don't think that your representation of the problem can be made to address this issue.
Here is an implementation of Travelling Salesman using LP. The solution space is of size n^3, where n is the number of rows in the distance matrix. This represents n consecutive copies of the nxn matrix, each of which represents the edge traversed at time t for 1<=t<=n. The constraints guarantee that
At most one edge is traversed each step
Ever vertex is visited exactly once
The startpoint of the i'th edge traversed is the same as the endpoint of the i-1'st
This avoids the problem of multiple small cycles. For example, with four vertices, the sequence (12)(21)(34)(43) would not be a valid solution because the endpoint of the second edge (21) does not match the start point of the third (34).
tspsolve<-function(x){
diag(x)<-1e10
## define some basic constants
nx<-nrow(x)
lx<-length(x)
objective<-matrix(x,lx,nx)
rowNum<-rep(row(x),nx)
colNum<-rep(col(x),nx)
stepNum<-rep(1:nx,each=lx)
## these constraints ensure that at most one edge is traversed each step
onePerStep.con<-do.call(cbind,lapply(1:nx,function(i) 1*(stepNum==i)))
onePerRow.rhs<-rep(1,nx)
## these constraints ensure that each vertex is visited exactly once
onceEach.con<-do.call(cbind,lapply(1:nx,function(i) 1*(rowNum==i)))
onceEach.rhs<-rep(1,nx)
## these constraints ensure that the start point of the i'th edge
## is equal to the endpoint of the (i-1)'st edge
edge.con<-c()
for(s in 1:nx){
s1<-(s %% nx)+1
stepMask<-(stepNum == s)*1
nextStepMask<- -(stepNum== s1)
for(i in 1:nx){
edge.con<-cbind(edge.con,stepMask * (colNum==i) + nextStepMask*(rowNum==i))
}
}
edge.rhs<-rep(0,ncol(edge.con))
## now bind all the constraints together, along with right-hand sides, and signs
constraints<-cbind(onePerStep.con,onceEach.con,edge.con)
rhs<-c(onePerRow.rhs,onceEach.rhs,edge.rhs)
signs<-rep("==",length(rhs))
list(constraints,rhs)
## call the lp solver
res<-lp("min",objective,constraints,signs,rhs,transpose=F,all.bin=T)
## print the output of lp
print(res)
## return the results as a sequence of vertices, and the score = total cycle length
list(cycle=colNum[res$solution==1],score=res$objval)
}
Here is an example:
set.seed(123)
x<-matrix(runif(16),c(4,4))
x
## [,1] [,2] [,3] [,4]
## [1,] 0.2875775 0.9404673 0.5514350 0.6775706
## [2,] 0.7883051 0.0455565 0.4566147 0.5726334
## [3,] 0.4089769 0.5281055 0.9568333 0.1029247
## [4,] 0.8830174 0.8924190 0.4533342 0.8998250
tspsolve(x)
## Success: the objective function is 2.335084
## $cycle
## [1] 1 3 4 2
##
## $score
## [1] 2.335084
We can check the correctness of this answer by using a primitive brute force search:
tspscore<-function(x,solution){
sum(sapply(1:nrow(x), function(i) x[solution[i],solution[(i%%nrow(x))+1]]))
}
tspbrute<-function(x,trials){
score<-Inf
cycle<-c()
nx<-nrow(x)
for(i in 1:trials){
temp<-sample(nx)
tempscore<-tspscore(x,temp)
if(tempscore<score){
score<-tempscore
cycle<-temp
}
}
list(cycle=cycle,score=score)
}
tspbrute(x,100)
## $cycle
## [1] 3 4 2 1
##
## $score
## [1] 2.335084
Note that, even though these answers are nominally different, they represent the same cycle.
For larger graphs, though, the brute force approach doesn't work:
> set.seed(123)
> x<-matrix(runif(100),10,10)
> tspsolve(x)
Success: the objective function is 1.296656
$cycle
[1] 1 10 3 9 5 4 8 2 7 6
$score
[1] 1.296656
> tspbrute(x,1000)
$cycle
[1] 1 5 4 8 10 9 2 7 6 3
$score
[1] 2.104487
This implementation is pretty efficient for small matrices, but, as expected, it starts to deteriorate severely as they get larger. At about 15x15 it starts slowing down quite a bit:
timetsp<-function(x,seed=123){
set.seed(seed)
m<-matrix(runif(x*x),x,x)
gc()
system.time(tspsolve(m))[3]
}
sapply(6:16,timetsp)
## elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed
## 0.011 0.010 0.018 0.153 0.058 0.252 0.984 0.404 1.984 20.003
## elapsed
## 5.565
You can use the gaoptim package to solve permutation/real valued problems - it's pure R, so it's not so fast:
Euro tour problem (see ?optim)
eurodistmat = as.matrix(eurodist)
# Fitness function (we'll perform a maximization, so invert it)
distance = function(sq)
{
sq = c(sq, sq[1])
sq2 <- embed(sq, 2)
1/sum(eurodistmat[cbind(sq2[,2], sq2[,1])])
}
loc = -cmdscale(eurodist, add = TRUE)$points
x = loc[, 1]
y = loc[, 2]
n = nrow(eurodistmat)
set.seed(1)
# solving code
require(gaoptim)
ga2 = GAPerm(distance, n, popSize = 100, mutRate = 0.3)
ga2$evolve(200)
best = ga2$bestIndividual()
# solving code
# just transform and plot the results
best = c(best, best[1])
best.dist = 1/max(ga2$bestFit())
res = loc[best, ]
i = 1:n
plot(x, y, type = 'n', axes = FALSE, ylab = '', xlab = '')
title ('Euro tour: TSP with 21 cities')
mtext(paste('Best distance found:', best.dist))
arrows(res[i, 1], res[i, 2], res[i + 1, 1], res[i + 1, 2], col = 'red', angle = 10)
text(x, y, labels(eurodist), cex = 0.8, col = 'gray20')
How do I compute the weighted mean in R?
For example, I have 4 elements of which 1 element is of size (or: length, width, etc.) 10 and 3 elements are of size 2.
> z = data.frame(count=c(1,3), size=c(10,2))
> z
count size
1 1 10
2 3 2
The weighted average is (10 * 1 + 2 * 3) / 4 = 4.
Use weighted.mean:
> weighted.mean(z$size, z$count)
[1] 4
Seems like you already know how to calculate this, just need a nudge in the right direction to implement it. Since R is vectorized, this is pretty simple:
with(z, sum(count*size)/sum(count))
The with bit just saves on typing and is equivalent to sum(z$count*z$size)/sum(z$count)
Or use the built in function weighted.mean() as you also pointed out. Using your own function can prove faster, though will not do the same amount of error checking that the builtin function does.
builtin <- function() with(z, weighted.mean(count, size))
rollyourown <- function() with(z, sum(count*size)/sum(count))
require(rbenchmark)
benchmark(builtin(), rollyourown(),
replications = 1000000,
columns = c("test", "elapsed", "relative"),
order = "relative")
#-----
test elapsed relative
2 rollyourown() 13.26 1.000000
1 builtin() 22.84 1.722474