I have a problem with spatial data.
I need to extract temperature data from a NetCDF file; then I need to associate this temperature at given latitude and longitude to another set of latitude and longitude contained in a different dataframe.
This is the code I used to extract my variables:
myfile <- nc_open(paste(wd, 'myfile.nc', sep=''))
timearr = ncvar_get(myfile, "time")
temp <- ncvar_get(myfile, 'temp_srf')
lat <- ncvar_get(myfile, 'lat_rho')
lon <- ncvar_get(myfile, 'lon_rho')
dim(temp)
[1] 27 75 52 # which means: 27 longitude * 75 latitudes * 52 time steps
I chose to work on the first time step of temperature for now. So:
> t1 <- as.vector(temp[,,1])
Then I created a data.frame including lat, lon and temperature in the first time step:
lat1 <- as.vector(lat)
lon1 <- as.vector(lon)
df1 <- as.data.frame(cbind(lon1, lat1, t1))
head(df1)
lon1 lat1 t1
1 18.15338 40.48656 13.96225
2 18.24083 40.55126 14.36726
3 18.32845 40.61589 14.53822
4 18.41627 40.68045 14.78643
5 18.50427 40.74495 14.88624
6 18.59246 40.80938 14.95925
In another data frame (df2) I have some random points of latitude and longitude, that I have to associate to the closest latitude and longitude of the previous data.frame:
> df2 <- read.csv(paste(id, "myfile.csv", sep=""), header=TRUE, sep=",")
> head(df2)
LONs LATs
1 14.13189 43.41072
2 14.13342 43.34871
3 14.09980 43.40822
4 14.05338 43.72771
5 13.91311 43.88051
6 13.98500 43.91164
I was thinking to get the distance between each point and get the lowest one, but I don't know how to do it. Not sure if there are other solutions.
I am assuming your data are projected coordinates, and that you need to calculate great circle distances. You can use a formula yourself (see my answer here), or you can use rdist.earth from the package fields. For each entry in df2, calculate the distance from all entries in df1, find the index of the minimum distance in that vector, and use that index to select the appropriate row df1 to assign temp to df2. It only takes one line (but it might be clearer to seperate the steps over a few commands):
require( fields )
df2["Temp"] <- df1[ sapply( seq_len( nrow(df2) ) , function(x){ which.min( rdist.earth( df2[x,] , as.matrix( df1[ c("lon1" , "lat1") ] ) , miles = FALSE, R = 6371 ) ) } ) , "t1" ]
And the results using your data:
df1
# lon1 lat1 t1
# 1 18.15338 40.48656 13.96225
# 2 18.24083 40.55126 14.36726
# 3 18.32845 40.61589 14.53822
# 4 18.41627 40.68045 14.78643
# 5 18.50427 40.74495 14.88624
# 6 18.59246 40.80938 14.95925
df2
# LONs LATs Temp
# 1 14.13189 43.41072 13.96225
# 2 14.13342 43.34871 13.96225
# 3 14.09980 43.40822 13.96225
# 4 14.05338 43.72771 14.53822
# 5 13.91311 43.88051 14.53822
# 6 13.98500 43.91164 14.78643
It looks like your distances are at least a Km apart (>300km in this data) so you should get good accuracy with the Great Circle formula. If they are smaller than 1km you may want to use the Haversine formula.
Two formulas I like for getting the distance between two lat/long coordinates are the Haversine formula and Vincenty's formula. The Haversine formula is a simpler formula that assumes Earth is a perfect sphere. You will probably get accuracy to a few feet. If you need a higher level of accuracy, try Vincenty's formula. It's spheroid based which attempts to account for Earth's imperfect sphere shape. The samples on the links aren't in R but it shouldn't be difficult to rewrite them in R.
Related
I have two large dataframes called intersections (representing intersections of a street system) and users (representing users of a network) as follows:
intersections has three columns: x,y and label_street. They respectively represent the position of an intersection in a squared observation window (say [0,5] x [0,5]) and the street it is located on. Here is an example:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
head(intersections)
x y label_street
1 0.147674 0.132956 5
2 0.235356 0.150813 6
3 0.095337 0.087345 5
4 0.147674 0.132956 6
An intersection being located at the crossing of several streets, every (x,y) combination in the intersections table appears at least twice, but with different label_street (e.g. rows 1 and 4 in the previous example). The label_street may not be the row number (which is why it starts at 5 in my example).
users has 4 columns: x,y, label_street, ID. They respectively represent the position of a user, the street it is located on and a unique ID per user. There are no duplicates in this dataframe, as a user is located on a unique street and has a unique ID. Here is an example (the ID and the label_street may not be the row number)
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), ID = c(2703, 3460, 4325, 12506, 19753, 21282))
head(users)
x y label_street ID
1 0.20428152 0.14448448 6 2703
2 0.17840619 0.13921481 6 3460
3 0.12964668 0.11724543 5 4325
4 0.20423856 0.14447573 6 12506
5 0.19349761 0.14228827 6 19753
6 0.10861251 0.09891443 5 21282
What I want to do is the following: for each point (x,y) of intersections, get the ID and the distance to its closest neighbour sharing the same street_label in users
I have a working solution using spatstat function nncross for nearest neighbour searching and plyr function adply for working on the data.
My working solution is as follows:
1) Write a user-defined function which gets the ID and the distance to the nearest neighbour of a row in a query table
NN <- function(row,query){
df <- row
window <- c(0,5,0,5) #Need this to convert to ppp objects and compute NN distance using nncross
NN <- nncross(as.ppp(row[,1:2],window),as.ppp(query[,1:2],window))
df$NN.ID <- query$ID[NN$which]
df$dist <- NN$dist
return(df)
}
2) Apply this user-defined function row-wise to my dataframe "intersections" with the query being the subset of users sharing the same street_label as the row :
result <- adply(intersections, 1, function(row) NN(row, users[users$label_street == row$label_street, ])
The result is as follows on the example:
head(result)
x y label_street NN.ID NN.dist
1 0.147674 0.132956 5 4325 0.02391247
2 0.235356 0.150813 6 2703 0.03171236
3 0.095337 0.087345 5 21282 0.01760940
4 0.147674 0.132956 6 3460 0.03136304
Since my real dataframes will be huge, I think computing distance matrices for looking at the nearest neighbour won't be efficient and that adply will be slow.
Does anyone have an idea of a data.table like solution? I only now about the basics of data.table and have always found it very efficient compared to plyr.
This solution uses the RANN package to find nearest neighbours. The trick is to first ensure that elements with different label_street have a higher distance between them than elements within the same label_street. We do this by adding an additional numeric column with a very large value that is constant within the same label_street but different between different values of label_street. In total, you get:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), number = c(2703, 3460, 4325, 12506, 19753, 21282))
# add a numeric column that is constant within each category and has a very large value
intersections$label_street_large <- intersections$label_street * 1e6
users$label_street_large <- users$label_street * 1e6
# call the nearest neighbour function (k = 1 neighbour)
nearest_neighbours <- RANN::nn2(
intersections[, c("x", "y", "label_street_large")],
users[, c("x", "y", "label_street_large")],
k = 1
)
# get original IDs and distances
IDs <- users$number[c(nearest_neighbours$nn.idx)]
distances <- c(nearest_neighbours$nn.dists)
IDs
# [1] 3460 12506 2703 3460 3460 4325
distances
# [1] 0.03171236 0.03136304 0.02391247 0.03175620 0.04271763 0.01760940
I hope this helps you. It should be very fast because it only call nn2 once, which runs in O(N * log(N)) time.
I have a list of co-ordinates of certain bus stops in this format
Bus_Stop_ID lat long
A -34.04199 18.61747
B -33.92312 18.44649
I then have a list of certain shops
Shop_ID lat long
1 -34.039350 18.617964
2 -33.927820 18.410520
I would like to check whether the shops fall within a 500 metre radius from the bus stop. Ultimately, the final dataset would look something like this where the Bus_Stop column indicates T/F and Bus_Stop_ID shows the relevant BUS ID(s) for that shop if Bus_Stop == T -
Shop_ID lat long Bus_Stop Bus_ID
1 -34.039350 18.617964 TRUE A
2 -33.927820 18.410520 FALSE #NA
Does anyone have an idea about how I can go about this using R? I've seen the package geosphere but have struggled to understand it given my relative inexperience in the spatial domain. Any ideas or packages you could recommend? Thank you
Updated to more scalable solution:
The previous answer (still included below) is not suited for large data sets. The reason is that we need to compute the distance for each pair of shops and bus. Therefore both the memory and computation scale as O(N*M) for N shops and M buses. A more scalable solution uses a data structure such as a KD-Tree to perform nearest neighbor search for each shop. The advantage here is that the computational complexity becomes O(M*logM) for building the KD-Tree for the bus stops and O(N*logM) for searching the nearest neighbor for each shop.
To do this, we can use nn2 from the RANN package. The complication here is that nn2 deals only with Euclidean distances and does not know anything about lat/long. Therefore, we need to convert the lat/long coordinates to some map projection (i.e. UTM) in order to use it correctly (i.e., in order to compute the Euclidean distance between shops and bus stops correctly).
Note: The following borrows heavily from Josh O'Brien's solutions for determining the UTM zone from a longitude and for converting lat/long to UTM, so he should take a bow.
## First define a function from Josh OBrien's answer to convert
## a longitude to its UTM zone
long2UTM <- function(long) {
(floor((long + 180)/6) %% 60) + 1
}
## Assuming that all points are within a zone (within 6 degrees in longitude),
## we use the first shop's longitude to get the zone.
z <- long2UTM(shops[1,"long"])
library(sp)
library(rgdal)
## convert the bus lat/long coordinates to UTM for the computed zone
## using the other Josh O'Brien linked answer
bus2 <- bus
coordinates(bus2) <- c("long", "lat")
proj4string(bus2) <- CRS("+proj=longlat +datum=WGS84")
bus.xy <- spTransform(bus2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
## convert the shops lat/long coordinates to UTM for the computed zone
shops2 <- shops
coordinates(shops2) <- c("long", "lat")
proj4string(shops2) <- CRS("+proj=longlat +datum=WGS84")
shops.xy <- spTransform(shops2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
library(RANN)
## find the nearest neighbor in bus.xy#coords for each shops.xy#coords
res <- nn2(bus.xy#coords, shops.xy#coords, 1)
## res$nn.dist is a vector of the distance to the nearest bus.xy#coords for each shops.xy#coords
## res$nn.idx is a vector of indices to bus.xy of the nearest bus.xy#coords for each shops.xy#coords
shops$Bus_Stop <- res$nn.dists <= 500
shops$Bus_ID <- ifelse(res$nn.dists <= 500, bus[res$nn.idx,"Bus_Stop_ID"], NA)
Although more complicated, this approach is much better suited for realistic problems where you may have large numbers of shops and bus stops. Using the same supplied data:
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
You can do this using the package geosphere. Here, I'm assuming that your first data frame is named bus, and your second data frame is named shops:
library(geosphere)
g <- expand.grid(1:nrow(shops), 1:nrow(bus))
d <- matrix(distGeo(shops[g[,1],c("long","lat")], bus[g[,2],c("long","lat")]),
nrow=nrow(shops))
shops$Bus_Stop <- apply(d, 1, function(x) any(x <= 500))
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
Notes:
We first use expand.grid to enumerate all pair combinations of shops and bus stops. These are ordered by shops first.
We then compute the distance matrix d using geosphere::distGeo. Note here that the input expects (lon, lat) coordinates. distGeo returns distances in meters. The resulting d matrix is now(shops) by now(bus) so that each row gives the distance from a shop to each bus stop.
We then see if there is a bus stop within 500 meters of each shop by applying the function any(x <= 500) for each row x in d using apply with MARGIN=1.
Similarly, we can extract the column of d (corresponding to the row in bus) for the first shop that is within 500 meters using which instead of any in our applied function. Then use this result to select the Bus_Stop_ID from bus.
By the way, we don't have to apply the condition x <= 500 twice. The following will also work:
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
shops$Bus_Stop <- !is.na(shops$Bus_ID)
and is more efficient.
Data:
bus <- structure(list(Bus_Stop_ID = structure(1:2, .Label = c("A", "B"
), class = "factor"), lat = c(-34.04199, -33.92312), long = c(18.61747,
18.44649)), .Names = c("Bus_Stop_ID", "lat", "long"), class = "data.frame", row.names = c(NA,
-2L))
shops <- structure(list(Shop_ID = 1:2, lat = c(-34.03935, -33.92782),
long = c(18.617964, 18.41052), Bus_ID = structure(c(1L, NA
), .Label = c("A", "B"), class = "factor"), Bus_Stop = c(TRUE,
FALSE)), .Names = c("Shop_ID", "lat", "long", "Bus_ID", "Bus_Stop"
), row.names = c(NA, -2L), class = "data.frame")
My first approach would be to just use Euclidean distance and check whether the resulting value is greater or equal 0.
You could then use an IF clause and check T/F conditions.
I hope this helps.
PS: In my imagination, a distance of 500m would be a rather flat representation of the Earth's surface, so I don't think it's needed to use some geoid packages.
I have 2 dataframes with thousands of data points:
lat lon v1
41.57 -88.11 11
41.58 -88.12 12
42.57 -89.11 55
41.55 -88.31 12
lat lon v2
41.57 -88.41 77
41.58 -88.12 56
42.57 -89.11 73
41.55 -88.61 14
And I want to:
Read each Lat-Long combination in Dataframe 1
Create a square bounding box
Search all the Lat-Long combinations in that bounding box in Dataframe 2
Find a mean of v2 in that bounding box and add it to corresponding row in Dataframe 1
I have got this far to write for a single (latval,lonval):
library(geosphere)
spatialmean<-function(latval,lonval,distance){
coords <- c(lonval, latval)
ne.coords <- c(destPoint(p = coords, b = 90, d = distance)[1],
destPoint(p = coords, b = 0, d = distance)[2])
sw.coords <- c(destPoint(p = coords, b = 90, d = -distance)[1],
destPoint(p = coords, b = 0, d = -distance)[2])
lon1<<-sw.coords[1] #Lon of the left side of the box
lat1<<-sw.coords[2] #Lat of the bottom side of the box
lon2<<-ne.coords[1] #Lon of the right side of the box
lat2<<-ne.coords[2] #Lat of the top side of the box
df2temp<<-subset(df2, lon>=lon1 & lon<=lon2 & lat>=lat1 & lat<=lat2)
meantemp<<-mean(df2temp$v2)
}
spatialmean(latval=42, lonval=-71,distance=5000)
How can I insert the logic of using lat-long combinations from DF1 and insert it back? I don't know how to pass rows as function arguments.
I think all you need to do is
df1$meanval <- mapply(FUN = spatialmean, latval = df1$lat,
lonval = df1$lon, distance = 5000)
However, I don't think this will be faster than a for loop. If speed is key, I would add the tag data.table to your question, as there is almost certainly a faster way to do this in data.table, but I'm not proficient enough with it to show you that solution.
If you can use actual distance instead of a top heavy square, I would use another function in the geosphere package to find distances. Your squares will have overlapping areas and include distances greater than 5000 in your mean calculations.
# sample data with extra row to understand distances row/column
df1 <- data.frame(lat = c(41.57,41.58,42.57,41.55,41.55),
lon = c(-88.11,-88.12,-89.11,-88.31,-88.31),
v1 = c(11,12,55,12,12))
df2 <- data.frame(lat = c(41.57,41.58,42.57,41.55),
lon = c(-88.41,-88.12,-88.11,-88.61),
v2 = c(77,56,73,14))
# set max distance
maxdist <- 5000
# calculate all distances and check if within limit
distances <- distm(x = df1[ ,c('lon','lat')],y = df2[ ,c('lon','lat')])
withindistance <- distances < maxdist
# grab all v2 based on df1 and calculate the mean. returns NaN if no points within distance
df1$df2mean <- apply(withindistance,1,function(x,funv2){
mean(funv2[x])
},funv2 = df2$v2)
# or the apply like most would write it. either apply works
df1$df2mean <- apply(withindistance,1,function(x){
mean(df2$v2[x])
})
I have 2 sets of points, set1 and set2. Both sets of points have a data associated with the point. Points in set1 are "ephemeral", and only exist on the given date. Points in set2 are "permanent", are constructed at a given date, and then exist forever after that date.
set.seed(1)
dates <- seq(as.Date('2011-01-01'),as.Date('2011-12-31'),by='days')
set1 <- data.frame(lat=40+runif(10000),
lon=-70+runif(10000),date=sample(dates,10000,replace=TRUE))
set2 <- data.frame(lat=40+runif(100),
lon=-70+runif(100),date=sample(dates,100,replace=TRUE))
Here's my problem: For each point in set1 (ephemeral) find the distance to the closest point in set2 (permanent) that was constructed BEFORE the event is set1 occurred. For example, the 1st point in set1 occurred on 2011-03-18:
> set1[1,]
lat lon date
1 40.26551 -69.93529 2011-03-18
So I want to find the closest point in set2 that was constructed before 2011-03-18:
> head(set2[set2$date<=as.Date('2011-04-08'),])
lat lon date
1 40.41531 -69.25765 2011-02-18
7 40.24690 -69.29812 2011-02-19
13 40.10250 -69.52515 2011-02-12
14 40.53675 -69.28134 2011-02-27
17 40.66236 -69.07396 2011-02-17
20 40.67351 -69.88217 2011-01-04
The additional wrinkle is that these are latitude/longitude points, so I have to calculate distances along the surface of the earth. The R package fields provides a convienent function to do this:
require(fields)
distMatrix <- rdist.earth(set1[,c('lon','lat')],
set2[,c('lon','lat')], miles = TRUE)
My question is, how can I adjust the distances in this matrix to Inf if the point in set2 (column of distance matrix) was constructed after the point in set1 (row of distances matrix)?
Here is what I would do:
earlierMatrix <- outer(set1$date, set2$date, "<=")
distMatrix2 <- distMatrix + ifelse(earlierMatrix, Inf, 0)
Here's my attempt at an answer. It's not particularly efficient, but I think it is correct. It also allows you to easily sub in different distance calculators:
#Calculate distances
require(fields)
distMatrix <- lapply(1:nrow(set1),function(x) {
#Find distances to all points
distances <- rdist.earth(set1[x,c('lon','lat')], set2[,c('lon','lat')], miles = TRUE)
#Set distance to Inf if the set1 point occured BEFORE the set2 dates
distances <- ifelse(set1[x,'date']<set2[,'date'], Inf, distances)
return(distances)
})
distMatrix <- do.call(rbind,distMatrix)
#Find distance to closest object
set1$dist <- apply(distMatrix,1,min)
#Find id of closest object
objectID <- lapply(1:nrow(set1),function(x) {
if (set1[x,'dist']<Inf) {
IDs <- which(set1[x,'dist']==distMatrix[x,])
} else {
IDs <- NA
}
return(sample(IDs,1)) #Randomly break ties (if there are any)
})
set1$objectID <- do.call(rbind,objectID)
Here's the head of the resulting dataset:
> head(set1)
lat lon date dist objectID
1 40.26551 -69.93529 2011-03-18 3.215514 13
2 40.37212 -69.32339 2011-02-11 10.320910 46
3 40.57285 -69.26463 2011-02-23 3.954132 4
4 40.90821 -69.88870 2011-04-24 4.132536 49
5 40.20168 -69.95335 2011-02-24 4.284692 45
6 40.89839 -69.86909 2011-07-12 3.385769 57
I'm trying to read a GRIB file wavedata.grib with wave heights from the ECMWF ERA-40 website, using an R function. Here is my source code until now:
mylat = 43.75
mylong = 331.25
# read the GRIB file
library(rgdal)
library(sp)
gribfile<-"wavedata.grib"
grib <- readGDAL(gribfile)
summary = GDALinfo(gribfile,silent=TRUE)
save(summary, file="summary.txt",ascii = TRUE)
# >names(summary): rows columns bands ll.x ll.y res.x res.y oblique.x oblique.y
rows = summary[["rows"]]
columns = summary[["columns"]]
bands = summary[["bands"]]
# z=geometry(grib)
# Grid topology:
# cellcentre.offset cellsize cells.dim
# x 326.25 2.5 13
# y 28.75 2.5 7
# SpatialPoints:
# x y
# [1,] 326.25 43.75
# [2,] 328.75 43.75
# [3,] 331.25 43.75
myframe<-t(data.frame(grib))
# myframe[bands+1,3]=331.25 myframe[bands+2,3]=43.75
# myframe[1,3]=2.162918 myframe[2,3]=2.427078 myframe[3,3]=2.211989
# These values should match the values read by Degrib (see below)
# degrib.exe wavedata.grib -P -pnt 43.75,331.25 -Interp 1 > wavedata.txt
# element, unit, refTime, validTime, (43.750000,331.250000)
# SWH, [m], 195709010000, 195709010000, 2.147
# SWH, [m], 195709020000, 195709020000, 2.159
# SWH, [m], 195709030000, 195709030000, 1.931
lines = rows * columns
mycol = 0
for (i in 1:lines) {
if (mylat==myframe[bands+2,i] & mylong==myframe[bands+1,i]) {mycol = i+1}
}
# notice mycol = i+1 in order to get values in column to the right
myvector <- as.numeric(myframe[,mycol])
sink("output.txt")
cat("lat:",myframe[bands+2,mycol],"long:",myframe[bands+1,mycol],"\n")
for (i in 1:bands) { cat(myvector[i],"\n") }
sink()
The wavedata.grib file has grided SWH values, in the period 1957-09-01 to 2002-08-31. Each band refers to a pair of lat/long and has a series of 16346 SWH values at 00h of each day (1 band = 16346 values at a certain lat/long).
myframe has dimensions 16438 x 91. Notice 91 = 7rows x 13columns. And the number 16438 is almost equal to number of bands. The additional 2 rows/bands are long and lat values, all other columns should be wave heights corresponding to the 16436 bands.
The problem is I want to extract SWH (wave heights) at lat/long = 43.75,331.25, but they don't match the values I get reading the file with Degrib utility at this same lat/long.
Also, the correct values I want (2.147, 2.159, 1.931, ...) are in column 4 and not column 3 of myframe, even though myframe[16438,3]=43.75 (lat) and myframe[16437,3]=331.25 (long). Why is this? I would like to know to which lat/long do myframe[i,j] values actually correspond to or if there is some data import error in the process. I'm assuming Degrib has no errors.
Is there any R routine to easily interpolate values in a matrix if I want to extract values between grid points? More generally, I need help in writing an effective R function to extract wave heights like this:
SWH <- function (latitude, longitude, date/time)
Please help.