I would like to look at aid effectiveness, by looking at whether 1) the aid projects are placed were the child mortality is the highest and 2) If a respondents geographical proximity to a project reduces child mortality. I have two datasets. Dataset1 contains the latitude and longitude of aid-projects, and dataset 2 contains the latitude and longitude of the respondents (they are put inside clusters in order to not reveal their identity). I would like to set a radius of 25km from each respondent (data2) and see how many projects are within this radius, and the name of these projects. I have seen tips on how to calculate distances within the same dataset (R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long), but not a good tip on how to calculate the distance between two datasets. Does anyone have any insights as to how I might solve this problem? I have some experience with R, but I´m still pretty newbie! The datasets contains 1593 aid-projects and 267 respondent-clusters. Examples of tree longitude/latitude-coordinates are given below:
Dataset 1 - Aid projects - Latitude, longitude (0.75 , 34.08333 ; 2.85604 , 32.43001 ; 1.78481 , 33.59432)
Dataset 2 - Respondents data (clusters)- latitude, longitude (2.34240627, 32.64138 ; 2.18839338, 32.65868; 2.04988464, 32.87160)
Any help would be very much appreciated!!
Best regards,
Synnøve
Related
I have a seemingly simple question that I can’t seem to figure out. I have a large dataset of millions of data points. Each data point represents a single fish with its biological information as well as when and where it was caught. I am running some statistics on these data and have been having issues which I have finally tracked down to some data points having latitude and longitude values that fall exactly on the corners of the grid cells which I am using to bin my data. When these fish with lats and long that fall exactly onto grid cell corners are grouped into their appropriate grid cell, they end up being duplicated 4 times (one for each cell that touches the grid cell corner their lats and long identify).
Needless to say this is bad and I need to force those animals to have lats and long that don’t put them exactly on a grid cell corner. I realize there are probably lots of ways to correct something like this but what I really need is a simply way to identify latitudes and longitudes that have integer values, and then to modify them by a very small amount (randomly adding or subtracting) so as to shift them into a specific cell without creating a bias by shifting them all the same way.
I hope this explanation makes sense. I have included a very simple example in order to provide a workable problem.
fish <- data.frame(fish=1:10, lat=c(25,25,25,25.01,25.2,25.1,25.5,25.7,25,25),
long=c(140,140,140,140.23,140.01,140.44,140.2,140.05,140,140))
In this fish data frame there are 10 fish, each with an associated latitude and longitude. Fish 1, 2, 3, 9, and 10 have integer lat and long values that will place them exactly on the corners of my grid cells. I need some way of shifting just these values by something like plus are minus 0.01.
I can identify which lats or longs are integers easy enough with something like:
fish %>%
near(as.integer(fish$lat))
But am struggling to find a way to then modify all the integer values by some small amount.
To answer my own question I was able to work this out this morning with some pretty basic code, see below. All it takes is making a function that actually looks for whole numbers, where is.integer does not.
# Used to fix the is.integer function to actually work and not just look at syntax
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
# Use ifelse to change only whole number values of lat and long
fish$jitter_lat <- ifelse(is.wholenumber(fish$lat), fish$lat+rnorm(fish$lat, mean=0, sd=0.01), fish$lat)
fish$jitter_long <- ifelse(is.wholenumber(fish$long), fish$long+rnorm(fish$long, mean=0, sd=0.01), fish$long)
I have two matrices with large amounts of gps data:
User Based GPS Data for each user i ((Latitude_i, Longitude_i), ...)) ~ 12 Mio GPS Coordinates
Store Based GPS Data for each store j ((Latitude_j, Longitude_j), ..)) ~ 15 k GPS Coordinates
What I need ultimately is the closest store j (from 2.) for each user i (from 1.).
The brut force (but computationally not feasible) solution would be, to calculate the geographical distance between each user i (from 1.) and each store j from (2.) and then take the lowest distance.
Since this would result in a 12 Mio x 15 k matrix and I do not have access to a Big Data infrastructure, this is not really working for me.
So I am looking for smart solutions right now.
What crossed my mind so far, was the idea of finding the simple numerically closest point between each user i (1.) and each store j (2.)
using apply and which.min(abs(lat_i-lat_j) + abs(long_i + long_j))
and then calculate the geographical distance between these two points.
However, the challenge here is that I need a function that minimizes the overall difference, consisting of two points and the above solution doesnt seem to work.
Any help is very much appreciated!!
I'm currently doing a classification project and the data I'm using includes lat/long attributes. In order to simply the model(s) I'm thinking it might be easier to replace the raw coordinates with a single column of 'grid' numbers.
By this I mean chop-up the area that the coordinates cover into an arbitrary number of grid points, number each square within the grid, and then replace the lat/long figures with the grid number which they fall in. For example, a 9 square grid might look like this:
123
456
789
I've done a fair bit of searching on here and Google and can't seem to find a solution. The closest I can find is the Universal Transverse Mercator coordinate system (which some R packages support), but the squares within this grid are too large. I'd like to be able to set the size of the grid myself.
I'm at a bit of a loss, and was wondering if the kind people of this forum knew of any R packages or techniques to achieve what I'd like. I'll append an example of my lat/long columns. Thanks.
Latitude Longitude
41.95469 -87.800991
41.95469 -87.800991
41.994991 -87.769279
41.974089 -87.824812
41.974089 -87.824812
41.9216 -87.666455
41.891118 -87.654491
41.867108 -87.654224
41.867108 -87.654224
41.896282 -87.655232
41.919343 -87.694259
Not especially elegant, but this works
pos <- data.frame(lat=c(
41.95469,
41.95469,
41.994991,
41.974089,
41.974089,
41.9216,
41.891118,
41.867108,
41.867108,
41.896282,
41.919343),
long=c(
-87.824812,
-87.769279,
-87.800991,
-87.800991,
-87.824812,
-87.666455,
-87.654491,
-87.654224,
-87.654224,
-87.655232,
-87.694259))
gridx <- seq(from=-87.9,to=-87.6,by=0.01)
gridy <- seq(from=41.8,to=42,by=0.01)
xcell <- unlist(lapply(pos$long,function(x) min(which(gridx>x))))
ycell <- unlist(lapply(pos$lat,function(y) min(which(gridy>y))))
pos$cell <- (length(gridx) - 1) * ycell + xcell
I am trying to write an R script to do pollution routing in world rivers, and need some help on selecting matrix cell coordinates and applying these to other matrices of the same dimension.
My data: I have several matrices corresponding to hydrological parameters of world rivers on a half degree grid (360 rows, 720 columns). These matrices represent flow accumulation (how many cells flow into this cell), flow direction (which of the 8 surrounding cells does the load of certain cell flow to) and pollutant load.
My idea: compute pollutant load in each grid cell from the start to the end of a river. I can base this on flow accumulation (low to high). However, each river basin can have multiple cells with the same flow accumulation value.
The problem: I need to select all matrix cells of each value of flow accumulation (low to high), find their coordinates (row,column), and transfer the corresponding pollutant load to the correct adjacent cell using the flow direction matrix. I have tried various ways, but selecting the coordinates of the correct cells and applying these to another matrix I cannot get to work.
I will give an example of what I have tried, using two for loops on one single river basin. In this example, a flow direction value of 1 means that the pollutant load needs to be transferred to the adjacent cell to the right (row is the same, column +1):
BasinFlowAccumulation <-FlowAccumulation[Basin]
BasinFlowAccumulationMaximum <- max(BasinFlowAccumulation)
BasinFlowDirection <-FlowDirection[Basin]
BasinPollutant <-Pollutant[Basin]
b<-0
for(i in 0:BasinFlowAccumulationMaximum){
cells.index<-which(BasinFlowAccumulation[]==b, arr.ind=TRUE)
for (j in 1:length(cells.index)){
print(BasinFlowDirection[cells[j]])
Row<-BasinPollutant[cells[j[1]]]
Column<-BasinPollutant[cells[j[2]]]
ifelse(BasinFlowDirection[cells.index[j]]==1, BasinPollutant[Row,(Column+1)]<-BasinPollutant[Row,(Column+1)]+Basinpollutant[Row,Column]
}
b<-b+1
}
Any advice would be greatly appreciated!
I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
track_1[1:10,]:
LAT LON SCORE
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
data.frame(cbind(
LAT=c(32.34855,31.54764,31.38293,31.21447,30.76365,30.75872,30.60261,30.62818,31.35912,31.15218),
LON=c(-35.49264,-35.58691,-35.25243,-35.25830,-35.38881,-35.54733,-35.95472,-36.27024,-35.73573,-36.38027),
SCORE=c(80.67,18.14,46.70,22.65,11.93,22.97,35.98,31.09,14.97,37.60)))
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.
Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
Example:
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
library(plyr)
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.