I have two data frames in which observations are geographic locations defined by a latitude/longitude combination. For each point in df1 I would like to get the closest point in df2 and the associated value. I know how to do that by computing all the possible distances (using e.g. the gdist function from the Imap package) and getting the index for the smallest one. But the fact is that it is at best excessively long as df1 has 1,000 rows and df2 some 15 millions.
Do you have an idea of how I could reach my goal without computing all the distances? Maybe there is a way to limit the necessary calculations (for instance using the difference in latitude/longitude values)?
Here's what df1looks like:
Latitude Longitude
1 56.76342 8.320824
2 54.93165 9.115982
3 55.80685 9.102455
4 57.27000 9.760000
5 56.76342 8.320824
6 56.89333 9.684435
7 56.62804 8.571573
8 56.64850 8.501947
9 55.40596 8.884374
10 54.89786 11.880828
then df2:
Latitude Longitude Value
1 41.91000 -4.780000 40500
2 41.61063 14.750832 13500
3 41.91000 -4.780000 4500
4 38.70000 -2.350000 28500
5 52.55172 0.088622 1500
6 39.06000 -1.830000 51000
7 41.91000 -4.780000 49500
8 48.00623 -4.389639 12000
9 56.24889 -3.666940 27000
10 42.72000 -3.750000 49500

Split the second frame into chunks of equal size
Then search only the chunks within the reasonable distance of your point. You will be basically drawing a checkerboard on a map. Your point will be within one of these squares - so you will search only that one and few neighboring ones to be safe.
Naive brute force search is rows(df1) * rows(df2). In our case 1000 * 15M, making for 15G operations times the computation time per operation.
So how do we split the data into chunks?
sort by latitude
sort by longitude
take equaly spaced chunks
Sort will take some Nlog(N) operations. N is 15M in our case so these two sorts will take
~2415M2 operations. Splitting in the chunks is then linear ~15M operations, maybe few times.
when you have this separation done, in each chunk you have total_points/(chunk_side ^ 2) points, assuming that your points are distributed equally.
The number of the chunks is proportional to the size of the chunk in the beginning:
total_area/(chunk_side ^ 2).
Ideally you want to balance the number of chunks with the number of points in each chunk so that both are ~ sqrt(points_total).
Each of the thousand searches will now take only chunk_count + points_in_chunk * 9 (if we want to be super safe and search the chunk our point lands in and all the surrounding ones.) So instead of 1000 * 15M you now have `1000 * (sqrt(15M) *18) ~ 1000 * 16K, an improvement by a factor of 50.
Note that this improvement will grow if the second set gets larger. Also the improvement will be smaller, if you choose the chunk size poorly.
For further improvement, you can iterate this once or twice more, making chunks in chunks. The logic is similar.

The distm function of geosphere package will help you:
# Make sure to put longitude first and then latitude:
df <- df %>% select(Longitude,Latitude)
distm(as.matrix(df), as.matrix(df), fun=distGeo)
Remenber, the distm function accepts matrix class objects. You will obtain a 10x10 matrix of distances.


Finding closest point between two vectors based on two dimensions

I have two matrices with large amounts of gps data:
User Based GPS Data for each user i ((Latitude_i, Longitude_i), ...)) ~ 12 Mio GPS Coordinates
Store Based GPS Data for each store j ((Latitude_j, Longitude_j), ..)) ~ 15 k GPS Coordinates
What I need ultimately is the closest store j (from 2.) for each user i (from 1.).
The brut force (but computationally not feasible) solution would be, to calculate the geographical distance between each user i (from 1.) and each store j from (2.) and then take the lowest distance.
Since this would result in a 12 Mio x 15 k matrix and I do not have access to a Big Data infrastructure, this is not really working for me.
So I am looking for smart solutions right now.
What crossed my mind so far, was the idea of finding the simple numerically closest point between each user i (1.) and each store j (2.)
using apply and which.min(abs(lat_i-lat_j) + abs(long_i + long_j))
and then calculate the geographical distance between these two points.
However, the challenge here is that I need a function that minimizes the overall difference, consisting of two points and the above solution doesnt seem to work.
Any help is very much appreciated!!

Negative length vectors are not allowed in distance function

I have a large data frame (375,000 row and 5 columns), all variables are numerical. I would like to spatio-temporal cluster this data frame using hierarchical clustering in R. However, when I try to calculate the distance matrix, I get the following error: "Negative length vectors are not allowed in distance function". Is it because of exceeding the maximum memory my computer has (16 GB RAM)? or is it due to exceeding the maximum length of any vector in R which is 2^31 - 1 (around 2 billions) elements? By the way, how to calculate the length of this distance matrix that I am trying to compute? is it 375,000^2 which equals nearly 100 billion?
In any case, what can I do regarding this problem? Can I somehow still use hierarchical clustering in this case?
Clustering using kmeans works perfectly but my supervisor prefers hierarchical clustering.
Any hints/suggestions will be greatly appreciated
P.S. Rows represent vehicle trips IDs, and columns represent: longitude of starting point, latitude of starting point, longitude of end point, latitude of end point and time of trip on specific day (all values are scaled for all variables).
Yes, 375000^2 exceeds the length of a vector.
The size of a matrix is roughly rows * cols * size of datatype.
Compute the amount of memory you need, then go back to your supervisor with that result.

Very slow raster::sampleRandom, what can I do as a workaround?

tl;dr: why is raster::sampleRandom taking so much time? e.g. to extract 3k cells from 30k cells (over 10k timesteps). Is there anything I can do to improve the situation?
EDIT: workaround at bottom.
Consider a R script in which I have to read a big file (usually more than 2-3GB) and perform quantile calculation over the data. I use the raster package to read the (netCDF) file. I'm using R 3.1.2 under 64bit GNU/Linux with 4GB of RAM, 3.5GB available most of the time.
As the files are often too big to fit into memory (even 2GB files for some reason will NOT fit into 3GB of available memory: unable to allocate vector of size 2GB) I cannot always do this, which is what I would do if I had 16GB of RAM:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(getValues(pr)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
But instead I can sample a smaller number of cells in my files using the function sampleRaster() from the raster package, still getting good statistics.
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(sampleRandom(pr, cnsample)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
I perform this over 6 different files (i goes from 1 to 6) which all have about 30k cells and 10k timesteps (so 300M values). Files are:
1.4GB, 1 variable, filesystem 1
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
1.2GB, 1 variable, filesystem 3
1.2GB, 1 variable, filesystem 3
Note that:
files are on three different nfs filesystem, whose performance I'm not sure of. I cannot rule out the fact that the nfs filesystems can greatly vary in performance from one moment to the other.
RAM usage is 100% all of the time when the script runs, but the system does not use all of it's swap.
sampleRandom(dataset, N) takes N non-NA random cells from one layer (= one timestep), and reads their content. Does so for the same N cells for each layer. If you visualize the dataset as a 3D matrix, with Z as timesteps, the function takes N random non-NA columns. However, I guess the function does not know that all the layers have the NAs in the same positions, so it has to check that any column it chooses does not have NAs in it.
When using the same commands on files with 8393 cells (about 340MB in total) and reading all the cells, the computing time is a fraction of trying to read 1000 cells from a file with 30k cells.
The full script which produces the output below is here, with comments etc.
If I try to read all the 30k cells:
cannot allocate vector of size 2.6 Gb
If I read 1000 cells:
5 minutes
45 m
30 m
30 m
20 m
20 m
If I read 3000 cells:
15 minutes
18 m
35 m
34 m
60 m
60 m
If I try to read 5000 cells:
2.5 h
22 h
for >2 I had to stop after 18h, I had to use the workstation for other tasks
With more tests, I've been able to find out that it's the sampleRandom() function that's taking most of the computing time, not the calculation of the quantile (which I can speed up using other quantile functions, such as kuantile()).
Why is sampleRandom() taking so long? Why does it perform so strangely, sometimes fast and sometimes very slow?
What is the best workaround? I guess I could manually generate N random cells for the 1st layer and then manually raster::extract for all timesteps.
Working workaround is to do:
cells <- sampleRandom(pr[[1]], cnsample, cells=T) #Extract cnsample random cells from the first layer, exluding NAs
prvals <- pr[cells[,1]] #Read those cells from all layers
qs <- quantile(prvals, probs=qprobs, na.rm=T, type=8, names=F) #Compute quantile
This works and is very fast because all layers have NAs in the same positions. I think this should be an option that sampleRandom() could implement.

Looping in R to extract data

I have an object in "R" called p_int. This is a list of 1599 peak intensity numbers.
Within every 8 values of this list is a monoisotopic peak. This peak is the most abundant (largest peak value) compared to the other 7 peaks.
Therefore what I'd like to do is write a loop which processes p_int in batches of 8.
So it will take the first 8 values, find the largest value and add this to a new object called "m_iso".
It will then continue, looking at values 9-16, 17-24, 25-32 etc.
Any advice or code in helping me achieve such a loop would be greatly appreciated.
By 1599 do you actually mean 1600? Because 1599 is not evenly divisible by 8. I'm going to assume this is true and offer the following:
m_iso <- sapply(split(p_int,rep(1:200,each=8)),max)
m_iso <- apply(matrix(p_int,nrow=8),2,max)
This will give you a vector of maximum values for each set of eight observations.

Cumulative sum of a georeferenced variable in R

I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.
Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.
