I would like to extract the coordinate from a vector that is closest to a test coordinate.
The task would be very similar to the previously posted:(Find the approximate value in the vector) but adapted to nDimensional cases and with multiple inputs.
In other words, given:
test=t(data.frame(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
))
reference=t(data.frame(
c(1,0,0.5),
c(2,2,2),
c(3.3,3.3,3.3),
c(9,9,9),
c(10,11,12)
))
result <- approximate(test,reference)
1 0 0.5
9 9 9
10 11 12
I programmed a function using euclidean distances and old school loops but when the inputs dataframes are big it results in looong executing times.
Anyone can figure it out a more efficient way of doing it?
Thank you in advance.
PS:This is the function I created that works but takes a while (in case someone could find it useful)
approximate_function<- function(approximate,reference){
# Function that returns for each entrance of approximate the closest value of reference
# It uses a euclidean distance.
# each entrance must be a row in the dataframe
# the number of columns of the df indicates the dimension of the points
# Sub function to calculate euclidean distance
distance_function<- function(a,b){
squaresum<-0
for(id in 1:length(a)){
squaresum=squaresum+(a[id]-b[id])^2
}
result=sqrt(squaresum)
return(result)
}
result<-data.frame()
#Choose 1 item from vector to aproximate at a time
for(id_approximate in 1:nrow(approximate)){
distance=c()
#Compare the value to aproximate with the reference points and chose the one with less distance
for(id_reference in 1:nrow(reference)){
distance[id_reference]<-distance_function(approximate[id_approximate,],reference[id_reference,])
}
result<-rbind(
result,
reference[which.min(distance),]
)
}
return(result)
}
This way the calculation is done instantly.
approximate_function<- function(approximate,reference){
# Function that returns for each entrance of approximate the closest value of reference
# It uses a euclidean distance.
# each entrance must be a row in the dataframe
# the number of columns of the df indicates the dimension of the points
results=data.frame()
#Choose 1 item from vector to aproximate at a time
for(id in 1:nrow(approximate)){
#calculates euclidean distances regardless the dimension
sumsquares=rep(0,nrow(reference))
for(dim in 1:ncol(approximate)){
sumsquares = sumsquares + (approximate[id,dim]-reference[,dim])^2
}
distances=sqrt(sumsquares)
results<- rbind(
results,
reference[which.min(distances),]
)
}
return(results)
}
You've got a few calculations that will be slow.
First:
test=t(data.frame(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
))
This one probably doesn't matter, but it would be better as
test=rbind(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
)
Same for setting up reference.
Second and third: You set up result as a dataframe, then add rows to it one at a time. Dataframes are much slower for row operations than matrices, and gradually growing structures in R is slow. So set it up as a matrix from the beginning at the right size, and assign results into specific rows.
EDITED to add:
Fourth: there's no need for the inner loop. You can calculate all the squared differences in one big matrix, then use rowSums or colSums to get the squared distances. This is easiest if you're working with matrix columns instead of rows, because vectors will be properly replicated automatically.
Fifth: There's no need to take the square root; if the squared distance is minimized, so is the distance.
Here's the result:
approximate <- function(test, reference){
# transpose the reference
reference <- t(reference)
# set up the result, not transposed
result <- test*NA
#Choose 1 item from vector to aproximate at a time
for(id in seq_len(nrow(test))){
squareddist <- colSums((test[id,] - reference)^2)
result[id,] <- reference[, which.min(squareddist)]
}
return(result)
}
Related
So I have a rather complex (at least for me) problem in R.
I want to calculate distances between two pair of distributions, for nearly 10k pairs.
I have a distance function from package philentropy, which takes two vectors x y and calculates the distance between them such as:
d <- distance(x, y, method="desired_method")
Another option is to create a matrix with each row representing a distribution, so that the function will calculate all pairwise distances among all distributions in the matrix:
d <- distance(x, method="desired_method")
I have two correlation matrices a and b with nearly 10k rows each, corresponding to 10k correlation distributions. Both matrices have the same number of rows, and my goal is to contrast first row of matrix a with first row of matrix b, second a row with second b row and so on, iteratively.
I can select each desired rows and perform the first distance usage, or I can merge the two matrices with rbind and perform all pairwise distances with second distance usage.
The problem is, with first approach, I do not know how to generate a for loop to iteratively get the nth row of each matrix, and perform distance calculation, while storing the result in a vector.
Additionally, if I perform the second option, I do not want to get all pairwise distances, but just distances corresponding to:
d[i,i+nrow(a)]
And doing so iteratively to generate a corresponding vector of nrow(a) values.
Any help?
If you have two matrices, mat_x and mat_y, each with the same number of rows, then the for loop would be:
answer <- vector(mode = 'numeric', length = 10000L)
for (i in 1:10000){
answer[[i]] <- distance(mat_x[i,], mat_y[i,], method="desired_method")
}
I'm trying to calculate the difference between all points in a vector of length 10605 in R. For example, I am trying to do this:
for (i in 1:10605){
for (j in 1:10605){
differences[i] = housedata$Mean_household_income[i] - housedata$Mean_household_income[j]
}
}
It is taking so long to compute, and I'm thinking there's a more timely way to calculate the difference between all the points with each other in this vector. Does anyone have any suggestions?
Thanks!
Seems like the dist function should do that. Distance matrices are only lower triangular because distance(x,y) == distance(y,x):
my.distances <- dist(housedata$Mean_household_income,
housedata$Mean_household_income)
It's going to be faster since it's done in C code. Just type:
dist
You could loop through an incrementally shifted/wrapped copy of the vector and subtract the two vectors. You still have to loop through the length of the data once and shift and subtract the vector each time, but it will probably save some time.
Here is an example:
# make a shift/wrap function
shift <- function(df,offset){
df[((1:length(df))-1-offset)%%length(df)+1]
}
# make some data
data <- seq(1,4)
# make an empty vector to hold the data
difs = vector()
# loop through the data
for(i in 1:length(data)){
shifted <- shift(data,i)
result <- data - shifted
difs <- c(difs, result)
}
print(difs)
What about using outer? It uses a vectorized function (here -) on all combinations of two vectors and stores the results in a matrix.
For example,
x <- runif(10605)
system.time(
differences <- outer(x, x, '-')
)
takes one second on my computer.
Suppose X is vector of length 100 with X position for 100 individuals. All agents start with position 0
X <- rep(0,100)
but they are embedded in a word with boundaries. I have a function that randomly changes the X position of all the agents at a given time.
Store <- X
X <- X + runif(100)
Eventually, one agent will reach the boundary and, at that point, it stay within the limits. The most simple way to do it using a looping through the vector and checking with if (in pseudo code):
for (i in 1:length(X)) {
if (between the boundaries) {keep the new X[i]} else {assign X[i] the value in Store[i]}
}
This is useful for 100 individual, but the for-loop adds too much computational time if the number of individual (and the length of the vector) increases, for example, to 1000000.
Is there a more straightforward way to do it? I was thinking that maybe I could skip specific re assignation of values that exceed the threshold during:
X <- X + runif(100)
EDIT: Also, imagine that X is not a vector but a matrix.
I realize this question is relatively old, but I just had the same question so I didn't want to leave it unanswered.
Limiting a vector or matrix to values within a certain range, can be done in a comprehensive way by combining an apply statement with min and max functions, as shown in the example below.
# Create sample vector
X <- c(1:100); print(X)
# Create sample matrix
M <- matrix(c(1:100),nrow=10); print(M)
# Set limits
minV <- 15; maxV <- 85;
# Limit vector
sapply(X, function(y) min(max(y,minV),maxV))
# Limit matrix
apply(M, c(1, 2), function(x) min(max(x,minV),maxV))
For further information on the apply functionality I would refer to the R documentation and this article on R-Bloggers:
https://www.r-bloggers.com/using-apply-sapply-lapply-in-r/
When I first came across apply statements I found it a difficult concept to wrap my head around, but would now consider it one of R's most powerful features.
I've got 2 dataframes each with 150 rows and 10 columns + column and row IDs. I want to correlate every row in one dataframe with every row in the other (e.g. 150x150 correlations) and plot the distribution of the resulting 22500 values.(Then I want to calculate p values etc from the distribution - but that's the next step).
Frankly I don't know where to start with this. I can read my data in and see how to correlate vectors or matching slices of two matrices etc., but I can't get handle on what I'm trying to do here.
set.seed(42)
DF1 <- as.data.frame(matrix(rnorm(1500),150))
DF2 <- as.data.frame(matrix(runif(1500),150))
#transform to matrices for better performance
m1 <- as.matrix(DF1)
m2 <- as.matrix(DF2)
#use outer to get all combinations of row numbers and apply a function to them
#22500 combinations is small enough to fit into RAM
cors <- outer(seq_len(nrow(DF1)),seq_len(nrow(DF2)),
#you need a vectorized function
#Vectorize takes care of that, but is just a hidden loop (slow for huge row numbers)
FUN=Vectorize(function(i,j) cor(m1[i,],m2[j,])))
hist(cors)
You can use cor with two arguments:
cor( t(m1), t(m2) )
I'm stuck with a simple loop that takes more than an hour to run, and need help to speed it up.
Basically, I have a matrix with 31 columns and 400 000 rows. The first 30 columns have values, and the 31st column has a column-number. I need to, per row, retrieve the value in the column indicated by the 31st column.
Example row: [26,354,72,5987..,461,3] (this means that the value in column 3 is sought after (72))
The too slow loop looks like this:
a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
a[i] <- data[i,data[i,31]]
}
I would think this would work:
a <- data[,data[,31]]
... but it results in "Error: cannot allocate vector of size 2.8 Mb".
I fear that this is a really simple question, so I've spent hours trying to understand apply, lapply, reshape, and more, but somehow I can't get a grip on the vectorization concept in R.
The matrix actually has even more columns that also go into the a-parameter, which is why I don't want to rebuild the matrix, or split it.
Your support is highly appreciated!
Chris
t(data[,1:30])[30*(0:399999)+data[,31]]
This works because you can reference matricies both in array format, and vector format (a 400000*31 long vector in this case) counting column-wise first. To count row-wise, you use the transpose.
Singe-index notation for the matrix may use less memory. This would involve doing something like:
i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]
Below is an example of single-index notation for matrices in R. In this example, the index of the per-row maximum is appended as the last column of a random matrix. This last column is then used to select the per-row maxima via single-index notation.
## create a random (10 x 5) matrix
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5
## columns
MM <- cbind(M,apply(M,1,which.max))
## column ID row ID
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))
Using an index matrix is an alternative that will probably use more memory but is slightly clearer:
ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))
Try to change the code to work a column at a time:
M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}
This sets all elements in a with the values from column i if the last column has value i. It took longer to build the matrix than to calculate vector a.