Negative length vectors are not allowed in distance function - r

I have a large data frame (375,000 row and 5 columns), all variables are numerical. I would like to spatio-temporal cluster this data frame using hierarchical clustering in R. However, when I try to calculate the distance matrix, I get the following error: "Negative length vectors are not allowed in distance function". Is it because of exceeding the maximum memory my computer has (16 GB RAM)? or is it due to exceeding the maximum length of any vector in R which is 2^31 - 1 (around 2 billions) elements? By the way, how to calculate the length of this distance matrix that I am trying to compute? is it 375,000^2 which equals nearly 100 billion?
In any case, what can I do regarding this problem? Can I somehow still use hierarchical clustering in this case?
Clustering using kmeans works perfectly but my supervisor prefers hierarchical clustering.
Any hints/suggestions will be greatly appreciated
P.S. Rows represent vehicle trips IDs, and columns represent: longitude of starting point, latitude of starting point, longitude of end point, latitude of end point and time of trip on specific day (all values are scaled for all variables).

Yes, 375000^2 exceeds the length of a vector.
The size of a matrix is roughly rows * cols * size of datatype.
Compute the amount of memory you need, then go back to your supervisor with that result.

Related

What is the most efficient way to find the closest geographic location?

I have two data frames in which observations are geographic locations defined by a latitude/longitude combination. For each point in df1 I would like to get the closest point in df2 and the associated value. I know how to do that by computing all the possible distances (using e.g. the gdist function from the Imap package) and getting the index for the smallest one. But the fact is that it is at best excessively long as df1 has 1,000 rows and df2 some 15 millions.
Do you have an idea of how I could reach my goal without computing all the distances? Maybe there is a way to limit the necessary calculations (for instance using the difference in latitude/longitude values)?
Thanks for helping,
Val
Here's what df1looks like:
Latitude Longitude
1 56.76342 8.320824
2 54.93165 9.115982
3 55.80685 9.102455
4 57.27000 9.760000
5 56.76342 8.320824
6 56.89333 9.684435
7 56.62804 8.571573
8 56.64850 8.501947
9 55.40596 8.884374
10 54.89786 11.880828
then df2:
Latitude Longitude Value
1 41.91000 -4.780000 40500
2 41.61063 14.750832 13500
3 41.91000 -4.780000 4500
4 38.70000 -2.350000 28500
5 52.55172 0.088622 1500
6 39.06000 -1.830000 51000
7 41.91000 -4.780000 49500
8 48.00623 -4.389639 12000
9 56.24889 -3.666940 27000
10 42.72000 -3.750000 49500
Split the second frame into chunks of equal size
Then search only the chunks within the reasonable distance of your point. You will be basically drawing a checkerboard on a map. Your point will be within one of these squares - so you will search only that one and few neighboring ones to be safe.
Naive brute force search is rows(df1) * rows(df2). In our case 1000 * 15M, making for 15G operations times the computation time per operation.
So how do we split the data into chunks?
sort by latitude
sort by longitude
take equaly spaced chunks
Sort will take some Nlog(N) operations. N is 15M in our case so these two sorts will take
~2415M2 operations. Splitting in the chunks is then linear ~15M operations, maybe few times.
when you have this separation done, in each chunk you have total_points/(chunk_side ^ 2) points, assuming that your points are distributed equally.
The number of the chunks is proportional to the size of the chunk in the beginning:
total_area/(chunk_side ^ 2).
Ideally you want to balance the number of chunks with the number of points in each chunk so that both are ~ sqrt(points_total).
Each of the thousand searches will now take only chunk_count + points_in_chunk * 9 (if we want to be super safe and search the chunk our point lands in and all the surrounding ones.) So instead of 1000 * 15M you now have `1000 * (sqrt(15M) *18) ~ 1000 * 16K, an improvement by a factor of 50.
Note that this improvement will grow if the second set gets larger. Also the improvement will be smaller, if you choose the chunk size poorly.
For further improvement, you can iterate this once or twice more, making chunks in chunks. The logic is similar.
The distm function of geosphere package will help you:
# Make sure to put longitude first and then latitude:
df <- df %>% select(Longitude,Latitude)
library(geosphere)
distm(as.matrix(df), as.matrix(df), fun=distGeo)
Remenber, the distm function accepts matrix class objects. You will obtain a 10x10 matrix of distances.

R: Rank cells in a list of matrices based on cell position

I have a list of matrices containing association measurements between GPS tracked animals. One matrix in the list is observed association rates, the others are association rates for randomized versions of the GPS tracking trajectories. For example, I currently have 99 permutations of randomized tracking trajectories resulting in a list of 99 animal association matrices, plus the observed association matrix. I am expecting that for the animals that belong to the same pack, the observed association rates will be higher than the randomized association rates. Accordingly, I would like to determine the rank of the observed rates compared to the randomized rates for each dyad (cell). Essentially, I am doing a rank-permutation test. However, since I am only really concerned with determining if the observed association data is greater than the randomized trajectory association data, any result just giving the rank of the observed cells is sufficient.
ls <- list(matrix(10:18,3,3), matrix(18:10,3,3))
I've seen using sapply can get the ranks of particular cells. Could I do the following for all cells and take the final number in the resulting vector to get the rank of the cell in that position in the list (knowing the position of the observed data in the list of matrices, e.g. last).
rank(sapply(ls, '[',1,1))
The ideal result would be a matrix of the same form as those in the list giving the rank of the observed data, although any similar solutions are welcome. Thanks in advance.
You can proceed that way, but there are cleaner and quicker methods to get what you want.
Here's some code that would take your ls produce a 3x3 matrix with the following properties:
if the entry in ls[[1]] is greater than the corresponding entry of ls[[2]], record a 1
if the entry in ls[[1]] is less than the corresponding entry of ls[[2]], record a 2
if the entries are equal, record a 1.5
result <- 1 * (ls[[1]] > ls[[2]]) + 2 * (ls[[1]] < ls[[2]]) + 1.5 * (ls[[1]] == ls[[2]])
How it works: when we do something like ls[[1]] > ls[[2]], we are ripping out the matrices of interest and directly comparing them. The result of this bit of code is a T/F-populated matrix, which is secretly coded as a 0/1 matrix. We can then multiply it by whatever coefficient we want to represent that situation.

Averaging different length vectors with same domain range in R

I have a dataset that looks like the one shown in the code.
What I am guaranteed is that the "(var)x" (domain) of the variable is always between 0 and 1. The "(var)y" (co-domain) can vary but is also bounded, but within a larger range.
I am trying to get an average over the "(var)x" but over the different variables.
I would like some kind of selective averaging, not sure how to do this in R.
ax=c(0.11,0.22,0.33,0.44,0.55,0.68,0.89)
ay=c(0.2,0.4,0.5,0.42,0.5,0.43,0.6)
bx=c(0.14,0.23,0.46,0.51,0.78,0.91)
by=c(0.1,0.2,0.52,0.46,0.4,0.41)
qx=c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
qy=c(0.03,0.2,0.52,0.4,0.45,0.48,0.61,0.9)
a<-list(ax,ay)
b<-list(bx,by)
q<-list(qx,qy)
What I would like to have something like
avgd_x = c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
and
avgd_y would have contents that would
find the value of ay and by at 0.12 and find the mean with ay, by and qy.
Similarly and so forth for all the values in the vector with the largest number of elements.
How can I do this in R ?
P.S: This is a toy dataset, my dataset is spread over files and I am reading them with a custom function, but the raw data is available as shown in the code below.
Edit:
Some clarification:
avgd_y would have the length of the largest vector, for example, in the case above, avgd_y would be (ay'+by'+qy)/3 where ay' and by' would be vectors which have c(ay(qx(i))) and c(by(qx(i))) for i from 1 to length of qx, ay' and by' would have values interpolated at data points of qx

Representing closeness among elements of a double vector

I have a double vector:
r = -50 + (50+50)*rand(10,1)
Now i want to ideally have all the numbers in the vector equal upto a tolerance of say 1e-4. I want to represent each r with a scalar say s(r) such that its value gives an idea of the quality of the vector. The vector is high quality if all elements in the vector are equal-like. I can easily run a for loop like
for i=1:10
for j=i+1:10
check equality upto the tolerance
end
end
But even then i cannot figure what computation to do inside the nested for loops to assign a scalar representing the quality . Is there a better way such that given any vector r length n, i can quickly calculate a scalar representing the quality of the vector.
Your double-loop algorithm is somewhat slow, of order O(n**2) where n is the number of dimensions of the vector. Here is a quick way to find the closeness of the vector elements, which can be done in order O(n), just one pass through the elements.
Find the maximum and the minimum of the vector elements. Just use two variables to store the maximum and minimum so far and run once through all the elements. The difference between the maximum and the minimum is called the range of the values, a commonly accepted measure of dispersion of the values. If the values are exactly equal, the range is zero which shows perfect quality. If the range is below 1e-4 then the vector is of acceptable quality. The bigger the range, the worse the equality.
The code is obvious for just about any given language, so I'll leave that to you. If the fact that the range only really considers the two extreme values of the vector bothers you, you could use other measures of variation such as the interquartile range, variance, or standard deviation. But the range seems to best fit what you request.

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

Resources