Cumulative sum of a georeferenced variable in R - r

I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
track_1[1:10,]:
LAT LON SCORE
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
data.frame(cbind(
LAT=c(32.34855,31.54764,31.38293,31.21447,30.76365,30.75872,30.60261,30.62818,31.35912,31.15218),
LON=c(-35.49264,-35.58691,-35.25243,-35.25830,-35.38881,-35.54733,-35.95472,-36.27024,-35.73573,-36.38027),
SCORE=c(80.67,18.14,46.70,22.65,11.93,22.97,35.98,31.09,14.97,37.60)))
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.

Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
Example:
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
library(plyr)
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.

Related

Moving spatial data off gird cell corners

I have a seemingly simple question that I can’t seem to figure out. I have a large dataset of millions of data points. Each data point represents a single fish with its biological information as well as when and where it was caught. I am running some statistics on these data and have been having issues which I have finally tracked down to some data points having latitude and longitude values that fall exactly on the corners of the grid cells which I am using to bin my data. When these fish with lats and long that fall exactly onto grid cell corners are grouped into their appropriate grid cell, they end up being duplicated 4 times (one for each cell that touches the grid cell corner their lats and long identify).
Needless to say this is bad and I need to force those animals to have lats and long that don’t put them exactly on a grid cell corner. I realize there are probably lots of ways to correct something like this but what I really need is a simply way to identify latitudes and longitudes that have integer values, and then to modify them by a very small amount (randomly adding or subtracting) so as to shift them into a specific cell without creating a bias by shifting them all the same way.
I hope this explanation makes sense. I have included a very simple example in order to provide a workable problem.
fish <- data.frame(fish=1:10, lat=c(25,25,25,25.01,25.2,25.1,25.5,25.7,25,25),
long=c(140,140,140,140.23,140.01,140.44,140.2,140.05,140,140))
In this fish data frame there are 10 fish, each with an associated latitude and longitude. Fish 1, 2, 3, 9, and 10 have integer lat and long values that will place them exactly on the corners of my grid cells. I need some way of shifting just these values by something like plus are minus 0.01.
I can identify which lats or longs are integers easy enough with something like:
fish %>%
near(as.integer(fish$lat))
But am struggling to find a way to then modify all the integer values by some small amount.
To answer my own question I was able to work this out this morning with some pretty basic code, see below. All it takes is making a function that actually looks for whole numbers, where is.integer does not.
# Used to fix the is.integer function to actually work and not just look at syntax
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
# Use ifelse to change only whole number values of lat and long
fish$jitter_lat <- ifelse(is.wholenumber(fish$lat), fish$lat+rnorm(fish$lat, mean=0, sd=0.01), fish$lat)
fish$jitter_long <- ifelse(is.wholenumber(fish$long), fish$long+rnorm(fish$long, mean=0, sd=0.01), fish$long)

What is the most efficient way to find the closest geographic location?

I have two data frames in which observations are geographic locations defined by a latitude/longitude combination. For each point in df1 I would like to get the closest point in df2 and the associated value. I know how to do that by computing all the possible distances (using e.g. the gdist function from the Imap package) and getting the index for the smallest one. But the fact is that it is at best excessively long as df1 has 1,000 rows and df2 some 15 millions.
Do you have an idea of how I could reach my goal without computing all the distances? Maybe there is a way to limit the necessary calculations (for instance using the difference in latitude/longitude values)?
Thanks for helping,
Val
Here's what df1looks like:
Latitude Longitude
1 56.76342 8.320824
2 54.93165 9.115982
3 55.80685 9.102455
4 57.27000 9.760000
5 56.76342 8.320824
6 56.89333 9.684435
7 56.62804 8.571573
8 56.64850 8.501947
9 55.40596 8.884374
10 54.89786 11.880828
then df2:
Latitude Longitude Value
1 41.91000 -4.780000 40500
2 41.61063 14.750832 13500
3 41.91000 -4.780000 4500
4 38.70000 -2.350000 28500
5 52.55172 0.088622 1500
6 39.06000 -1.830000 51000
7 41.91000 -4.780000 49500
8 48.00623 -4.389639 12000
9 56.24889 -3.666940 27000
10 42.72000 -3.750000 49500
Split the second frame into chunks of equal size
Then search only the chunks within the reasonable distance of your point. You will be basically drawing a checkerboard on a map. Your point will be within one of these squares - so you will search only that one and few neighboring ones to be safe.
Naive brute force search is rows(df1) * rows(df2). In our case 1000 * 15M, making for 15G operations times the computation time per operation.
So how do we split the data into chunks?
sort by latitude
sort by longitude
take equaly spaced chunks
Sort will take some Nlog(N) operations. N is 15M in our case so these two sorts will take
~2415M2 operations. Splitting in the chunks is then linear ~15M operations, maybe few times.
when you have this separation done, in each chunk you have total_points/(chunk_side ^ 2) points, assuming that your points are distributed equally.
The number of the chunks is proportional to the size of the chunk in the beginning:
total_area/(chunk_side ^ 2).
Ideally you want to balance the number of chunks with the number of points in each chunk so that both are ~ sqrt(points_total).
Each of the thousand searches will now take only chunk_count + points_in_chunk * 9 (if we want to be super safe and search the chunk our point lands in and all the surrounding ones.) So instead of 1000 * 15M you now have `1000 * (sqrt(15M) *18) ~ 1000 * 16K, an improvement by a factor of 50.
Note that this improvement will grow if the second set gets larger. Also the improvement will be smaller, if you choose the chunk size poorly.
For further improvement, you can iterate this once or twice more, making chunks in chunks. The logic is similar.
The distm function of geosphere package will help you:
# Make sure to put longitude first and then latitude:
df <- df %>% select(Longitude,Latitude)
library(geosphere)
distm(as.matrix(df), as.matrix(df), fun=distGeo)
Remenber, the distm function accepts matrix class objects. You will obtain a 10x10 matrix of distances.

Finding closest point between two vectors based on two dimensions

I have two matrices with large amounts of gps data:
User Based GPS Data for each user i ((Latitude_i, Longitude_i), ...)) ~ 12 Mio GPS Coordinates
Store Based GPS Data for each store j ((Latitude_j, Longitude_j), ..)) ~ 15 k GPS Coordinates
What I need ultimately is the closest store j (from 2.) for each user i (from 1.).
The brut force (but computationally not feasible) solution would be, to calculate the geographical distance between each user i (from 1.) and each store j from (2.) and then take the lowest distance.
Since this would result in a 12 Mio x 15 k matrix and I do not have access to a Big Data infrastructure, this is not really working for me.
So I am looking for smart solutions right now.
What crossed my mind so far, was the idea of finding the simple numerically closest point between each user i (1.) and each store j (2.)
using apply and which.min(abs(lat_i-lat_j) + abs(long_i + long_j))
and then calculate the geographical distance between these two points.
However, the challenge here is that I need a function that minimizes the overall difference, consisting of two points and the above solution doesnt seem to work.
Any help is very much appreciated!!

Averaging different length vectors with same domain range in R

I have a dataset that looks like the one shown in the code.
What I am guaranteed is that the "(var)x" (domain) of the variable is always between 0 and 1. The "(var)y" (co-domain) can vary but is also bounded, but within a larger range.
I am trying to get an average over the "(var)x" but over the different variables.
I would like some kind of selective averaging, not sure how to do this in R.
ax=c(0.11,0.22,0.33,0.44,0.55,0.68,0.89)
ay=c(0.2,0.4,0.5,0.42,0.5,0.43,0.6)
bx=c(0.14,0.23,0.46,0.51,0.78,0.91)
by=c(0.1,0.2,0.52,0.46,0.4,0.41)
qx=c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
qy=c(0.03,0.2,0.52,0.4,0.45,0.48,0.61,0.9)
a<-list(ax,ay)
b<-list(bx,by)
q<-list(qx,qy)
What I would like to have something like
avgd_x = c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
and
avgd_y would have contents that would
find the value of ay and by at 0.12 and find the mean with ay, by and qy.
Similarly and so forth for all the values in the vector with the largest number of elements.
How can I do this in R ?
P.S: This is a toy dataset, my dataset is spread over files and I am reading them with a custom function, but the raw data is available as shown in the code below.
Edit:
Some clarification:
avgd_y would have the length of the largest vector, for example, in the case above, avgd_y would be (ay'+by'+qy)/3 where ay' and by' would be vectors which have c(ay(qx(i))) and c(by(qx(i))) for i from 1 to length of qx, ay' and by' would have values interpolated at data points of qx

How to select multiple cells in a matrix and perform an operation on corresponding cells in another matrix of the same size?

I am trying to write an R script to do pollution routing in world rivers, and need some help on selecting matrix cell coordinates and applying these to other matrices of the same dimension.
My data: I have several matrices corresponding to hydrological parameters of world rivers on a half degree grid (360 rows, 720 columns). These matrices represent flow accumulation (how many cells flow into this cell), flow direction (which of the 8 surrounding cells does the load of certain cell flow to) and pollutant load.
My idea: compute pollutant load in each grid cell from the start to the end of a river. I can base this on flow accumulation (low to high). However, each river basin can have multiple cells with the same flow accumulation value.
The problem: I need to select all matrix cells of each value of flow accumulation (low to high), find their coordinates (row,column), and transfer the corresponding pollutant load to the correct adjacent cell using the flow direction matrix. I have tried various ways, but selecting the coordinates of the correct cells and applying these to another matrix I cannot get to work.
I will give an example of what I have tried, using two for loops on one single river basin. In this example, a flow direction value of 1 means that the pollutant load needs to be transferred to the adjacent cell to the right (row is the same, column +1):
BasinFlowAccumulation <-FlowAccumulation[Basin]
BasinFlowAccumulationMaximum <- max(BasinFlowAccumulation)
BasinFlowDirection <-FlowDirection[Basin]
BasinPollutant <-Pollutant[Basin]
b<-0
for(i in 0:BasinFlowAccumulationMaximum){
cells.index<-which(BasinFlowAccumulation[]==b, arr.ind=TRUE)
for (j in 1:length(cells.index)){
print(BasinFlowDirection[cells[j]])
Row<-BasinPollutant[cells[j[1]]]
Column<-BasinPollutant[cells[j[2]]]
ifelse(BasinFlowDirection[cells.index[j]]==1, BasinPollutant[Row,(Column+1)]<-BasinPollutant[Row,(Column+1)]+Basinpollutant[Row,Column]
}
b<-b+1
}
Any advice would be greatly appreciated!

Resources