Related
Last week I asked this question: How to efficiently calculate distance between GPS points in one dataset and GPS points in another data set using data.table on the matching of GPS coordinates of a gps tracker with the Gps coordinates of bus stop. User Hugh, gave an answer that works many times faster than the solution I eventually came up with. However, there is one problem I cannot get fixed in his/her code (I feel like I really tried all iterations possibilities of the code) I need to get a rolling join in which multiple bus stops that are near a GPS point get listed. In the current code only the nearest is given, but a GPS coordinate can of course be near multiple bus stops. I know this means I will have to turn around the Y and the X in the rolling join somewhere , but somehow it just doesnt work. Can someone maybe point me towards a solution?
All the code and full explanation can be found in the original question, but this is how I create the data:
# create GPS data
number_of_GPS_coordinates <- 1000000
set.seed(1)
gpsdata<-data.frame(id=1:number_of_GPS_coordinates,
device_id = 1,
latitude=runif(number_of_GPS_coordinates,50.5,53.5),
longitude=runif(number_of_GPS_coordinates,4,7))
# create Bus stop data
set.seed(1)
number_of_bus_stops <- 55000
busdata<-data.frame(id=1:number_of_bus_stops,
name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")),
latitude=runif(number_of_bus_stops,50.5,53.5),
longitude=runif(number_of_bus_stops,4,7))
And this is the function that does work for me, but is still not very fast (much faster than my first attempt):
check_if_close <- function(dataset1 = GPS.Utrecht.to.Gouda,
dataset2 = bus_stops,
n.splits = 500,
desired.dist = .2){
# dataset1 needs at least the columns
# - "id",
# - "device_id"
# - "latitude"
# - "longitude"
# dataset2 needs at least the columns
# - "id",
# - "name"
# - "latitude"
# - "longitude"
# these are the average coordinates of the Netherlands. A change of ,.0017 in latitude leads to a change of 189 meters
# spDistsN1(matrix(c(5.2913, 52.1326), ncol=2), matrix(c(5.2913, 52.1326+.0017), ncol=2), longlat=TRUE)*1000
# [1] 189.1604
# this means that the latitude slices we can cut (the subsection of) the Netherlands is have to be at least .0017 wide.
# if we look at the Netherlands a whole this would mean we can use max (53.5-50.5)/.0017 = 1765 slices.
# if we look only at a small subsection (because we are only looking a a single trip for example we need much less slices.
# 1) we only select the variables we need from dataset 1
dataset1 <- setDT(dataset1)[,c("id", "device_id", "latitude", "longitude")]
setnames(dataset1, old = c("id", "latitude", "longitude"), new = c("id_dataset1", "latitude_gps", "longitude_gps"))
# 2) we only select the variables we need from dataset 2
dataset2 <- setDT(dataset2)[,c("id", "name", "latitude", "longitude")]
setnames(dataset2, old = c("id", "latitude", "longitude"), new = c("id_dataset2", "latitude_feature", "longitude_feature"))
# 3) only keep subet of dataset2 that falls within dataset 1.
# There is no reason to check if features are close that already fall out of the GPS coordinates in the trip we want to check
# We do add a 0.01 point margin around it to be on the save side. Maybe a feature falls just out the GPS coordinates,
# but is still near to a GPS point
dataset2 <- dataset2[latitude_feature %between% (range(dataset1$latitude_gps) + c(-0.01, +0.01))
& longitude_feature %between% (range(dataset1$longitude_gps) + c(-0.01, +0.01)), ]
# 4) we cut the dataset2 into slices on the latitude dimension
# some trial and error is involved getting the right amount. if you add to many you get a large and redudant amount of empty values
# if you add to few you get you need to check too many GPS to feauture distances per slice
dataset2[, range2 := as.numeric(Hmisc::cut2(dataset2$latitude_feature, g=n.splits))]
# 5) calculate the ranges of the slices we just created
ranges <- dataset2[,list(Min=min(latitude_feature), Max= max(latitude_feature)), by=range2][order(range2)]
setnames(ranges, old = c("range2", "Min", "Max"), new = c("latitude_range", "start", "end"))
# 6) now we assign too which slice every GPS coordinate in our dataset1 belongs
# this is super fast when using data.table grammar
elements1 <- dataset1$latitude_gps
ranges <- setDT(ranges)[data.table(elements1), on = .(start <= elements1, end >=elements1)]
ranges[, rowID := seq_len(.N)]
dataset1[,rowID := seq_len(.N)]
setkey(dataset1, rowID)
setkey(ranges, rowID)
dataset1<-dataset1[ranges]
# 7) this is the actual function we use to check if a datapoint is nearby.
# potentially there are faster function to do this??
checkdatapoint <- function(p, h, dist=desired.dist) {
distances <- spDistsN1(data.matrix(filter(dataset1,latitude_range==h)[,c("longitude_gps","latitude_gps")]),
p,
longlat=TRUE) # in km
return(which(distances <= dist)) # distance is now set to 200 meters
}
# 8) we assign a ID to the dataset1 starting again at every slice.
# we need this to later match the data again
dataset1[, ID2 := sequence(.N), by = latitude_range]
# 9) here we loop over all the splits and for every point check if there is a feature nearby in the slice it falls in
# to be on the save side we also check the slice left and right of it, just to make sure we do not miss features that
# are nearby, but just fall in a different slice.
# 9a: create an empty list we fill with dataframes later
TT<-vector("list", length=n.splits)
# 9b: loop over the number of slices using above defined function
for(i in 1:n.splits){
datapoints.near.feature<-apply(data.matrix(dataset2[range2 %in% c(i-1,i, i+1), c("longitude_feature","latitude_feature")]), 1, checkdatapoint, h=i)
# 9c: if in that slice there was no match between a GPS coordinate and an nearby feature, we create an empty list input
if(class(datapoints.near.feature)=="integer"|class(datapoints.near.feature)=="matrix"){
TT[[i]] <-NULL
} else {
# 9d: if there was a match we get a list of data point that are named
names(datapoints.near.feature) <- dataset2[range2 %in% c(i-1,i, i+1), name]
# 9e: then we 'melt' this list into data.frame
temp <- melt(datapoints.near.feature)
# 9f: then we transform it into a data.table and change the names
setDT(temp)
setnames(temp, old=c("value", "L1"), new= c("value", "feature_name"))
# 9h: then we only select the data point in dataset1 that fall in the current slice give them an
# ID and merge them with the file of nearby busstops
gpsdata.f <- dataset1[latitude_range==i, ]
gpsdata.f[, rowID2 := seq_len(.N)]
setkey(gpsdata.f, key = "rowID2")
setkey(temp, key = "value")
GPS.joined.temp <- merge(x = gpsdata.f, y = temp, by.x= "rowID2", by.y= "value", all.x=TRUE)
# 9i: we only keep the unique entries and for every slice save them to the list
GPS.joined.unique.temp <- unique(GPS.joined.temp, by=c("id_dataset1", "feature_name"))
TT[[i]] <- GPS.joined.unique.temp
cat(paste0(round(i/n.splits*100), '% completed'), " \r"); flush.console()
#cat(i/n.splits*100, " \r"); flush.console()
}
}
# 10) now we left join the original dataset and and the data point that are near a feature
finallist<- merge(x = dataset1,
y = rbindlist(TT[vapply(TT, Negate(is.null), NA)]),
by.x= "id_dataset1",
by.y= "id_dataset1",
all.x=TRUE)
# 11) we add a new logical variable to check if any bus stop is near
finallist[, nearby := TRUE][is.na(feature_name), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# 12) if a point is near multiple features at once these are listed in a vector,
# instead of having duplicate rows with teh same id but different features
finallist <- unique(setDT(finallist)[order(id_dataset1, feature_name), list(feature_name=list(feature_name), id=id_dataset1, lat=latitude_gps.x, lon=longitude_gps.x, nearby=nearby), by=id_dataset1], by="id_dataset1")
return(finallist)
}
This is the code #Hugh proposed, which is super fast and almost does the same thing. except that multiple bus stations to a point are not listed and only the nearest ones.
# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates,
lat=runif(number_of_GPS_coordinates,50.5,53.5),
lon=runif(number_of_GPS_coordinates,4,7)))
# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))
names(busdata.data) <- c("latitude_bustops", "longitude_bustops", "name")
library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)
gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)
setkey(gpsdata, lat)
# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]
busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]
setkey(busdata.data, lat)
gpsID_by_lat <-
gpsdata[, .(id), keyby = "lat"]
By_latitude <-
busdata.data[gpsdata,
on = "lat",
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L] %>%
.[, .(id_lat = id,
name_lat = name,
bus_lat = latitude_bustops,
bus_lon = longitude_bustops,
gps_lat,
gps_lon),
keyby = .(lon = gps_lon)]
setkey(busdata.data, lon)
By_latlon <-
busdata.data[By_latitude,
on = c("name==name_lat", "lon"),
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L]
By_latlon[, distance := haversine_distance(lat1 = gps_lat,
lon1 = gps_lon,
lat2 = bus_lat,
lon2 = bus_lon)]
By_latlon[distance < 0.2]
Can someone help me adjust the code in such a way that not only the nearest bus stop, but all stops within 200 meters are given (in a list, or separate rows)
I am facing a coding (optimization) problem in R. I have a long data set with GPS coordinates (lon, lat, timestamp) and for every row I need to check whether the location is near a bus stop. I have a .csv file with all the bus stops (in the Netherlands). The GPS coordinates file is millions of entries long, but could be split if necessary. The bus stop dataset is around 5500 entries long.
Using the code and tips given on, inter alia, these pages:
1) How to efficiently calculate distance between pair of coordinates using data.table :=
2) Using a simple for loop on spatial data
3) Calculate distance between two latitude-longitude points? (Haversine formula)
4) Fastest way to determine COUNTRY from millions of GPS coordinates [R]
I was able to construct a code that works, but is (too) slow. I was wondering if someone can help me with a faster data.table() implementation or can point out where the bottle neck in my code is? Is it the spDistsN1() function, or maybe the apply and melt() functions combination? I am most comfortable in R, but open to other software (as long as it is open source).
Due to privacy concerns I cannot upload the full dataset, but this is a (small) reproducible example that is not too different from how the real data looks.
# packages:
library(data.table)
library(tidyverse)
library(sp)
# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates,
lat=runif(number_of_GPS_coordinates,50.5,53.5),
lon=runif(number_of_GPS_coordinates,4,7)))
# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))
names(busdata.data) <- c("latitude_bustops", "longitude_bustops", "name")
Download the real bus stop data if you want, kind of hard to reproduce a random sample of this.
#temp <- tempfile()
#download.file("http://data.openov.nl/haltes/stops.csv.gz", temp) #1.7MB
#gzfile(temp, 'rt')
#busstopdata <- read.csv(temp, stringsAsFactors = FALSE)
#unlink(temp)
#bus_stops <- fread("bus_stops.csv")
#busdata.data <- busstopdata %>%
# mutate(latitude_bustops = latitude)%>%
# mutate(longitude_bustops = longitude)%>%
# dplyr::select(name, latitude_bustops, longitude_bustops)
Code I use now to calculate distances. It works but it is pretty slow
countDataPoints3 <- function(p) {
distances <- spDistsN1(data.matrix(gpsdata[,c("lon","lat")]),
p,
longlat=TRUE) # in km
return(which(distances <= .2)) # distance is now set to 200 meters
}
# code to check per data point if a bus stop is near and save this per bus stop in a list entry
datapoints.by.bustation <- apply(data.matrix(busdata.data[,c("longitude_bustops","latitude_bustops")]), 1, countDataPoints3)
# rename list entries
names(datapoints.by.bustation) <- busdata.data$name
# melt list into one big data.frame
long.data.frame.busstops <- melt(datapoints.by.bustation)
# now switch to data.table grammar to speed up process
# set data.table
setDT(gpsdata)
gpsdata[, rowID := 1:nrow(gpsdata)]
setkey(gpsdata, key = "rowID")
setDT(long.data.frame.busstops)
# merge the data, and filter non-unique entries
setkey(long.data.frame.busstops, key = "value")
GPS.joined <- merge(x = gpsdata, y = long.data.frame.busstops, by.x= "rowID", by.y= "value", all.x=TRUE)
GPS.joined.unique <- unique(GPS.joined, by="id") # mak
# this last part of the code is needed to make sure that if there are more than 1 bus stop nearby it puts these bus stop in a list
# instead of adding row and making the final data.frame longer than the original one
GPS.joined.unique2 <- setDT(GPS.joined.unique)[order(id, L1), list(L1=list(L1)), by=id]
GPS.joined.unique2[, nearby := TRUE][is.na(L1), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# makes sense:
as.tibble(GPS.joined.unique2) %>%
summarize(sum = sum(nearby))
Consider cutting using an slicing method: first cut by close latitudes and close longitudes. In this case 0.5 latitude and 0.5 longitude (which is still about a 60 km disc). We can use data.table's superb support of rolling joins.
The following takes a few milliseconds for 20,000 entries and only a few seconds for 2M entries.
library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)
gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)
setkey(gpsdata, lat)
# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]
busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]
setkey(busdata.data, lat)
gpsID_by_lat <-
gpsdata[, .(id), keyby = "lat"]
By_latitude <-
busdata.data[gpsdata,
on = "lat",
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L] %>%
.[, .(id_lat = id,
name_lat = name,
bus_lat = latitude_bustops,
bus_lon = longitude_bustops,
gps_lat,
gps_lon),
keyby = .(lon = gps_lon)]
setkey(busdata.data, lon)
By_latlon <-
busdata.data[By_latitude,
on = c("name==name_lat", "lon"),
# within 0.5 degrees of latitude
roll = 0.5,
# +/-
rollends = c(TRUE, TRUE),
# and remove those beyond 0.5 degrees
nomatch=0L]
By_latlon[, distance := haversine_distance(lat1 = gps_lat,
lon1 = gps_lon,
lat2 = bus_lat,
lon2 = bus_lon)]
By_latlon[distance < 0.2]
This is the function I came up with so far. #Dave2e, thanks. It is already an awful lot faster than what I had. There still is clearly room for a lot of improvement, but as it stands it is fast enough for my analysis now. I only slice by latitude and not longitude. The only reason for that is that it makes indexing and then looping over indices really easy, but more speed could be gained by also indexing by longitude. Also, in real GPS data there tend to be many duplicate values (same lon/lat, different time stamp), the code would also be more efficient if it would take this into account. Maybe I will work on that in the future.
# this app could be much faster if it would filter by duplicate GPS coordinates
check_if_close <- function(dataset1 = GPS.Utrecht.to.Gouda,
dataset2 = bus_stops,
n.splits = 500,
desired.dist = .2){
# dataset1 needs at least the columns
# - "id",
# - "device_id"
# - "latitude"
# - "longitude"
# dataset2 needs at least the columns
# - "id",
# - "name"
# - "latitude"
# - "longitude"
# these are the average coordinates of the Netherlands. A change of ,.0017 in latitude leads to a change of 189 meters
# spDistsN1(matrix(c(5.2913, 52.1326), ncol=2), matrix(c(5.2913, 52.1326+.0017), ncol=2), longlat=TRUE)*1000
# [1] 189.1604
# this means that the latitude slices we can cut (the subsection of) the Netherlands is have to be at least .0017 wide.
# if we look at the Netherlands a whole this would mean we can use max (53.5-50.5)/.0017 = 1765 slices.
# if we look only at a small subsection (because we are only looking a a single trip for example we need much less slices.
# 1) we only select the variables we need from dataset 1
dataset1 <- setDT(dataset1)[,c("id", "device_id", "latitude", "longitude")]
setnames(dataset1, old = c("id", "latitude", "longitude"), new = c("id_dataset1", "latitude_gps", "longitude_gps"))
# 2) we only select the variables we need from dataset 2
dataset2 <- setDT(dataset2)[,c("id", "name", "latitude", "longitude")]
setnames(dataset2, old = c("id", "latitude", "longitude"), new = c("id_dataset2", "latitude_feature", "longitude_feature"))
# 3) only keep subet of dataset2 that falls within dataset 1.
# There is no reason to check if features are close that already fall out of the GPS coordinates in the trip we want to check
# We do add a 0.01 point margin around it to be on the save side. Maybe a feature falls just out the GPS coordinates,
# but is still near to a GPS point
dataset2 <- dataset2[latitude_feature %between% (range(dataset1$latitude_gps) + c(-0.01, +0.01))
& longitude_feature %between% (range(dataset1$longitude_gps) + c(-0.01, +0.01)), ]
# 4) we cut the dataset2 into slices on the latitude dimension
# some trial and error is involved getting the right amount. if you add to many you get a large and redudant amount of empty values
# if you add to few you get you need to check too many GPS to feauture distances per slice
dataset2[, range2 := as.numeric(Hmisc::cut2(dataset2$latitude_feature, g=n.splits))]
# 5) calculate the ranges of the slices we just created
ranges <- dataset2[,list(Min=min(latitude_feature), Max= max(latitude_feature)), by=range2][order(range2)]
setnames(ranges, old = c("range2", "Min", "Max"), new = c("latitude_range", "start", "end"))
# 6) now we assign too which slice every GPS coordinate in our dataset1 belongs
# this is super fast when using data.table grammar
elements1 <- dataset1$latitude_gps
ranges <- setDT(ranges)[data.table(elements1), on = .(start <= elements1, end >=elements1)]
ranges[, rowID := seq_len(.N)]
dataset1[,rowID := seq_len(.N)]
setkey(dataset1, rowID)
setkey(ranges, rowID)
dataset1<-dataset1[ranges]
# 7) this is the actual function we use to check if a datapoint is nearby.
# potentially there are faster function to do this??
checkdatapoint <- function(p, h, dist=desired.dist) {
distances <- spDistsN1(data.matrix(filter(dataset1,latitude_range==h)[,c("longitude_gps","latitude_gps")]),
p,
longlat=TRUE) # in km
return(which(distances <= dist)) # distance is now set to 200 meters
}
# 8) we assign a ID to the dataset1 starting again at every slice.
# we need this to later match the data again
dataset1[, ID2 := sequence(.N), by = latitude_range]
# 9) here we loop over all the splits and for every point check if there is a feature nearby in the slice it falls in
# to be on the save side we also check the slice left and right of it, just to make sure we do not miss features that
# are nearby, but just fall in a different slice.
# 9a: create an empty list we fill with dataframes later
TT<-vector("list", length=n.splits)
# 9b: loop over the number of slices using above defined function
for(i in 1:n.splits){
datapoints.near.feature<-apply(data.matrix(dataset2[range2 %in% c(i-1,i, i+1), c("longitude_feature","latitude_feature")]), 1, checkdatapoint, h=i)
# 9c: if in that slice there was no match between a GPS coordinate and an nearby feature, we create an empty list input
if(class(datapoints.near.feature)=="integer"|class(datapoints.near.feature)=="matrix"){
TT[[i]] <-NULL
} else {
# 9d: if there was a match we get a list of data point that are named
names(datapoints.near.feature) <- dataset2[range2 %in% c(i-1,i, i+1), name]
# 9e: then we 'melt' this list into data.frame
temp <- melt(datapoints.near.feature)
# 9f: then we transform it into a data.table and change the names
setDT(temp)
setnames(temp, old=c("value", "L1"), new= c("value", "feature_name"))
# 9h: then we only select the data point in dataset1 that fall in the current slice give them an
# ID and merge them with the file of nearby busstops
gpsdata.f <- dataset1[latitude_range==i, ]
gpsdata.f[, rowID2 := seq_len(.N)]
setkey(gpsdata.f, key = "rowID2")
setkey(temp, key = "value")
GPS.joined.temp <- merge(x = gpsdata.f, y = temp, by.x= "rowID2", by.y= "value", all.x=TRUE)
# 9i: we only keep the unique entries and for every slice save them to the list
GPS.joined.unique.temp <- unique(GPS.joined.temp, by=c("id_dataset1", "feature_name"))
TT[[i]] <- GPS.joined.unique.temp
cat(paste0(round(i/n.splits*100), '% completed'), " \r"); flush.console()
#cat(i/n.splits*100, " \r"); flush.console()
}
}
# 10) now we left join the original dataset and and the data point that are near a feature
finallist<- merge(x = dataset1,
y = rbindlist(TT[vapply(TT, Negate(is.null), NA)]),
by.x= "id_dataset1",
by.y= "id_dataset1",
all.x=TRUE)
# 11) we add a new logical variable to check if any bus stop is near
finallist[, nearby := TRUE][is.na(feature_name), nearby := FALSE] # add a dummy to check if any bus stop is nearby.
# 12) if a point is near multiple features at once these are listed in a vector,
# instead of having duplicate rows with teh same id but different features
finallist <- unique(setDT(finallist)[order(id_dataset1, feature_name), list(feature_name=list(feature_name), id=id_dataset1, lat=latitude_gps.x, lon=longitude_gps.x, nearby=nearby), by=id_dataset1], by="id_dataset1")
return(finallist)
}
I have a list of co-ordinates of certain bus stops in this format
Bus_Stop_ID lat long
A -34.04199 18.61747
B -33.92312 18.44649
I then have a list of certain shops
Shop_ID lat long
1 -34.039350 18.617964
2 -33.927820 18.410520
I would like to check whether the shops fall within a 500 metre radius from the bus stop. Ultimately, the final dataset would look something like this where the Bus_Stop column indicates T/F and Bus_Stop_ID shows the relevant BUS ID(s) for that shop if Bus_Stop == T -
Shop_ID lat long Bus_Stop Bus_ID
1 -34.039350 18.617964 TRUE A
2 -33.927820 18.410520 FALSE #NA
Does anyone have an idea about how I can go about this using R? I've seen the package geosphere but have struggled to understand it given my relative inexperience in the spatial domain. Any ideas or packages you could recommend? Thank you
Updated to more scalable solution:
The previous answer (still included below) is not suited for large data sets. The reason is that we need to compute the distance for each pair of shops and bus. Therefore both the memory and computation scale as O(N*M) for N shops and M buses. A more scalable solution uses a data structure such as a KD-Tree to perform nearest neighbor search for each shop. The advantage here is that the computational complexity becomes O(M*logM) for building the KD-Tree for the bus stops and O(N*logM) for searching the nearest neighbor for each shop.
To do this, we can use nn2 from the RANN package. The complication here is that nn2 deals only with Euclidean distances and does not know anything about lat/long. Therefore, we need to convert the lat/long coordinates to some map projection (i.e. UTM) in order to use it correctly (i.e., in order to compute the Euclidean distance between shops and bus stops correctly).
Note: The following borrows heavily from Josh O'Brien's solutions for determining the UTM zone from a longitude and for converting lat/long to UTM, so he should take a bow.
## First define a function from Josh OBrien's answer to convert
## a longitude to its UTM zone
long2UTM <- function(long) {
(floor((long + 180)/6) %% 60) + 1
}
## Assuming that all points are within a zone (within 6 degrees in longitude),
## we use the first shop's longitude to get the zone.
z <- long2UTM(shops[1,"long"])
library(sp)
library(rgdal)
## convert the bus lat/long coordinates to UTM for the computed zone
## using the other Josh O'Brien linked answer
bus2 <- bus
coordinates(bus2) <- c("long", "lat")
proj4string(bus2) <- CRS("+proj=longlat +datum=WGS84")
bus.xy <- spTransform(bus2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
## convert the shops lat/long coordinates to UTM for the computed zone
shops2 <- shops
coordinates(shops2) <- c("long", "lat")
proj4string(shops2) <- CRS("+proj=longlat +datum=WGS84")
shops.xy <- spTransform(shops2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
library(RANN)
## find the nearest neighbor in bus.xy#coords for each shops.xy#coords
res <- nn2(bus.xy#coords, shops.xy#coords, 1)
## res$nn.dist is a vector of the distance to the nearest bus.xy#coords for each shops.xy#coords
## res$nn.idx is a vector of indices to bus.xy of the nearest bus.xy#coords for each shops.xy#coords
shops$Bus_Stop <- res$nn.dists <= 500
shops$Bus_ID <- ifelse(res$nn.dists <= 500, bus[res$nn.idx,"Bus_Stop_ID"], NA)
Although more complicated, this approach is much better suited for realistic problems where you may have large numbers of shops and bus stops. Using the same supplied data:
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
You can do this using the package geosphere. Here, I'm assuming that your first data frame is named bus, and your second data frame is named shops:
library(geosphere)
g <- expand.grid(1:nrow(shops), 1:nrow(bus))
d <- matrix(distGeo(shops[g[,1],c("long","lat")], bus[g[,2],c("long","lat")]),
nrow=nrow(shops))
shops$Bus_Stop <- apply(d, 1, function(x) any(x <= 500))
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
Notes:
We first use expand.grid to enumerate all pair combinations of shops and bus stops. These are ordered by shops first.
We then compute the distance matrix d using geosphere::distGeo. Note here that the input expects (lon, lat) coordinates. distGeo returns distances in meters. The resulting d matrix is now(shops) by now(bus) so that each row gives the distance from a shop to each bus stop.
We then see if there is a bus stop within 500 meters of each shop by applying the function any(x <= 500) for each row x in d using apply with MARGIN=1.
Similarly, we can extract the column of d (corresponding to the row in bus) for the first shop that is within 500 meters using which instead of any in our applied function. Then use this result to select the Bus_Stop_ID from bus.
By the way, we don't have to apply the condition x <= 500 twice. The following will also work:
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
shops$Bus_Stop <- !is.na(shops$Bus_ID)
and is more efficient.
Data:
bus <- structure(list(Bus_Stop_ID = structure(1:2, .Label = c("A", "B"
), class = "factor"), lat = c(-34.04199, -33.92312), long = c(18.61747,
18.44649)), .Names = c("Bus_Stop_ID", "lat", "long"), class = "data.frame", row.names = c(NA,
-2L))
shops <- structure(list(Shop_ID = 1:2, lat = c(-34.03935, -33.92782),
long = c(18.617964, 18.41052), Bus_ID = structure(c(1L, NA
), .Label = c("A", "B"), class = "factor"), Bus_Stop = c(TRUE,
FALSE)), .Names = c("Shop_ID", "lat", "long", "Bus_ID", "Bus_Stop"
), row.names = c(NA, -2L), class = "data.frame")
My first approach would be to just use Euclidean distance and check whether the resulting value is greater or equal 0.
You could then use an IF clause and check T/F conditions.
I hope this helps.
PS: In my imagination, a distance of 500m would be a rather flat representation of the Earth's surface, so I don't think it's needed to use some geoid packages.
I have 2 dataframes with thousands of data points:
lat lon v1
41.57 -88.11 11
41.58 -88.12 12
42.57 -89.11 55
41.55 -88.31 12
lat lon v2
41.57 -88.41 77
41.58 -88.12 56
42.57 -89.11 73
41.55 -88.61 14
And I want to:
Read each Lat-Long combination in Dataframe 1
Create a square bounding box
Search all the Lat-Long combinations in that bounding box in Dataframe 2
Find a mean of v2 in that bounding box and add it to corresponding row in Dataframe 1
I have got this far to write for a single (latval,lonval):
library(geosphere)
spatialmean<-function(latval,lonval,distance){
coords <- c(lonval, latval)
ne.coords <- c(destPoint(p = coords, b = 90, d = distance)[1],
destPoint(p = coords, b = 0, d = distance)[2])
sw.coords <- c(destPoint(p = coords, b = 90, d = -distance)[1],
destPoint(p = coords, b = 0, d = -distance)[2])
lon1<<-sw.coords[1] #Lon of the left side of the box
lat1<<-sw.coords[2] #Lat of the bottom side of the box
lon2<<-ne.coords[1] #Lon of the right side of the box
lat2<<-ne.coords[2] #Lat of the top side of the box
df2temp<<-subset(df2, lon>=lon1 & lon<=lon2 & lat>=lat1 & lat<=lat2)
meantemp<<-mean(df2temp$v2)
}
spatialmean(latval=42, lonval=-71,distance=5000)
How can I insert the logic of using lat-long combinations from DF1 and insert it back? I don't know how to pass rows as function arguments.
I think all you need to do is
df1$meanval <- mapply(FUN = spatialmean, latval = df1$lat,
lonval = df1$lon, distance = 5000)
However, I don't think this will be faster than a for loop. If speed is key, I would add the tag data.table to your question, as there is almost certainly a faster way to do this in data.table, but I'm not proficient enough with it to show you that solution.
If you can use actual distance instead of a top heavy square, I would use another function in the geosphere package to find distances. Your squares will have overlapping areas and include distances greater than 5000 in your mean calculations.
# sample data with extra row to understand distances row/column
df1 <- data.frame(lat = c(41.57,41.58,42.57,41.55,41.55),
lon = c(-88.11,-88.12,-89.11,-88.31,-88.31),
v1 = c(11,12,55,12,12))
df2 <- data.frame(lat = c(41.57,41.58,42.57,41.55),
lon = c(-88.41,-88.12,-88.11,-88.61),
v2 = c(77,56,73,14))
# set max distance
maxdist <- 5000
# calculate all distances and check if within limit
distances <- distm(x = df1[ ,c('lon','lat')],y = df2[ ,c('lon','lat')])
withindistance <- distances < maxdist
# grab all v2 based on df1 and calculate the mean. returns NaN if no points within distance
df1$df2mean <- apply(withindistance,1,function(x,funv2){
mean(funv2[x])
},funv2 = df2$v2)
# or the apply like most would write it. either apply works
df1$df2mean <- apply(withindistance,1,function(x){
mean(df2$v2[x])
})
I have problems by merging two dataframes with different length.
To make it as easy as possible the datasets:
Dataset A - Persons
http://pastebin.com/HbaeqACi
Dataset B - Waterfeatures:
http://pastebin.com/UdDvNtHs
Dataset C - City:
http://pastebin.com/nATnkMRk
I have some R-Code , which is not relevant for my problem, but I will paste it completely, so you have exactly the same situation:
require(fossil)
library(fossil)
#load data
persons = read.csv("person.csv", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("water.csv", header =TRUE, stringsAsFactors=FALSE)
city = read.csv("city.csv", header =TRUE)
#### calculate distance
# Generate unique coordinates dataframe
UniqueCoordinates <- data.frame(unique(persons[,4:5]))
UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)
#Generate a function that looks for the closest waterfeature for each id coordinates and calculate/save the distance
NearestW <- function(id){
tmp <- UniqueCoordinates[UniqueCoordinates$Id==id, 1:2]
WaterFeatures <- rbind(tmp,water[,2:3])
disnw <- earth.dist(WaterFeatures, dist=TRUE)[1:(nrow(WaterFeatures)-1)]
disnw <- min(disnw)
disnw <- data.frame(disnw, WaterFeature=tmp)
return(disnw)
}
# apply distance calculation function to each id and the merge
CoordinatesWaterFeature <- ldply(UniqueCoordinates$Id, NearestW)
persons <- merge(persons, CoordinatesWaterFeature, by.x=c(4,5), by.y=c(2,3))
Now I want to copy the calculated distance to the city dataset. I've tried to use merge (both datasets have the city attribute) and the persons only contains the cities from the city dataset.
city_all_parameters = city
city_all_parameters = merge(city_all_parameters, persons[,c("city", "disnw")], all=TRUE)
Unfortunately this is not the outcome, which I want to have. I have 164 rows, but I only want to have these 5 rows + the variable disnw and it's corresponding value.
I've tried it out with rbind as well, but there I get the error:
"Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match"
Any tip, how to solve this problem?
Your code works as you intended, but I wanted to show you a more elegant way to do it in base. I have commented the code:
library(fossil)
# If you want to use pastebin, you can make it easy to load in for us like this:
# But I recommend using dput(persons) and pasting the results in.
persons = read.csv("http://pastebin.com/raw.php?i=HbaeqACi", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("http://pastebin.com/raw.php?i=UdDvNtHs", header =TRUE, stringsAsFactors=FALSE)
city = read.csv("http://pastebin.com/raw.php?i=nATnkMRk", header =TRUE)
# Use column names instead of column indices to clarify your code
UniqueCoordinates <- data.frame(unique(persons[,c('POINT_X','POINT_Y')]))
# I didn't understand why you wanted to format the Id,
# but you don't need the Id in this code
# UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)
# Instead of calculating the pairwise distance between all
# the water points everytime, use deg.dist with mapply:
UniqueCoordinates$disnw <- mapply(function(x,y) min(deg.dist(long1=x,lat1=y,
long2=water$POINT_X,
lat2=water$POINT_Y)),
UniqueCoordinates$POINT_X,
UniqueCoordinates$POINT_Y)
persons <- merge(UniqueCoordinates,persons)
# I think this is what you wanted...
unique(persons[c('city','disnw')])
# city disnw
# 1 City E 6.4865635
# 20 City A 1.6604204
# 69 City B 0.9893909
# 113 City D 0.6001968
# 148 City C 0.5308953
# If you want to merge to the city
merge(persons,city,by='city')