I have 2 sets of points, set1 and set2. Both sets of points have a data associated with the point. Points in set1 are "ephemeral", and only exist on the given date. Points in set2 are "permanent", are constructed at a given date, and then exist forever after that date.
set.seed(1)
dates <- seq(as.Date('2011-01-01'),as.Date('2011-12-31'),by='days')
set1 <- data.frame(lat=40+runif(10000),
lon=-70+runif(10000),date=sample(dates,10000,replace=TRUE))
set2 <- data.frame(lat=40+runif(100),
lon=-70+runif(100),date=sample(dates,100,replace=TRUE))
Here's my problem: For each point in set1 (ephemeral) find the distance to the closest point in set2 (permanent) that was constructed BEFORE the event is set1 occurred. For example, the 1st point in set1 occurred on 2011-03-18:
> set1[1,]
lat lon date
1 40.26551 -69.93529 2011-03-18
So I want to find the closest point in set2 that was constructed before 2011-03-18:
> head(set2[set2$date<=as.Date('2011-04-08'),])
lat lon date
1 40.41531 -69.25765 2011-02-18
7 40.24690 -69.29812 2011-02-19
13 40.10250 -69.52515 2011-02-12
14 40.53675 -69.28134 2011-02-27
17 40.66236 -69.07396 2011-02-17
20 40.67351 -69.88217 2011-01-04
The additional wrinkle is that these are latitude/longitude points, so I have to calculate distances along the surface of the earth. The R package fields provides a convienent function to do this:
require(fields)
distMatrix <- rdist.earth(set1[,c('lon','lat')],
set2[,c('lon','lat')], miles = TRUE)
My question is, how can I adjust the distances in this matrix to Inf if the point in set2 (column of distance matrix) was constructed after the point in set1 (row of distances matrix)?
Here is what I would do:
earlierMatrix <- outer(set1$date, set2$date, "<=")
distMatrix2 <- distMatrix + ifelse(earlierMatrix, Inf, 0)
Here's my attempt at an answer. It's not particularly efficient, but I think it is correct. It also allows you to easily sub in different distance calculators:
#Calculate distances
require(fields)
distMatrix <- lapply(1:nrow(set1),function(x) {
#Find distances to all points
distances <- rdist.earth(set1[x,c('lon','lat')], set2[,c('lon','lat')], miles = TRUE)
#Set distance to Inf if the set1 point occured BEFORE the set2 dates
distances <- ifelse(set1[x,'date']<set2[,'date'], Inf, distances)
return(distances)
})
distMatrix <- do.call(rbind,distMatrix)
#Find distance to closest object
set1$dist <- apply(distMatrix,1,min)
#Find id of closest object
objectID <- lapply(1:nrow(set1),function(x) {
if (set1[x,'dist']<Inf) {
IDs <- which(set1[x,'dist']==distMatrix[x,])
} else {
IDs <- NA
}
return(sample(IDs,1)) #Randomly break ties (if there are any)
})
set1$objectID <- do.call(rbind,objectID)
Here's the head of the resulting dataset:
> head(set1)
lat lon date dist objectID
1 40.26551 -69.93529 2011-03-18 3.215514 13
2 40.37212 -69.32339 2011-02-11 10.320910 46
3 40.57285 -69.26463 2011-02-23 3.954132 4
4 40.90821 -69.88870 2011-04-24 4.132536 49
5 40.20168 -69.95335 2011-02-24 4.284692 45
6 40.89839 -69.86909 2011-07-12 3.385769 57
Related
There is five polygons for five different cities (see attached file in the link, it's called bound.shp). I also have a point file "points.csv" with longitude and latitude where for each point I know the proportion of people belonging to group m and group h.
I am trying to calculate the spatial segregation proposed by Reardon and O’Sullivan, “Measures of Spatial Segregation”
There is a package called "seg" which should allow us to do it. I am trying to do it but so far no success.
Here is the link to the example file: LINK. After downloading the "example". This is what I do:
setwd("~/example")
library(seg)
library(sf)
bound <- st_read("bound.shp")
points <- st_read("points.csv", options=c("X_POSSIBLE_NAMES=x","Y_POSSIBLE_NAMES=y"))
#I apply the following formula
seg::spseg(bound, points[ ,c(group_m, group_h)] , smoothing = "kernel", sigma = bandwidth)
Error: 'x' must be a numeric matrix with two columns
Can someone help me solve this issue? Or is there an alternate method which I can use?
Thanks a lot.
I don't know what exactly spseg function does but when evaluating the spseg function in the seg package documentation;
First argument x should be dataframe or object of class Spatial.
Second argument data should be matrix or dataframe.
After evaluating the Examples for spseg function, it should have been noted that the data should have the same number of rows as the id number of the Spatial object. In your sample, the id is the cities that have different polygons.
First, let's examine the bound data;
setwd("~/example")
library(seg)
library(sf)
#For the fortify function
library(ggplot2)
bound <- st_read("bound.shp")
bound <- as_Spatial(bound)
class(bound)
"SpatialPolygonsDataFrame"
attr(,"package")
"sp"
tail(fortify(bound))
Regions defined for each Polygons
long lat order hole piece id group
5379 83.99410 27.17326 972 FALSE 1 5 5.1
5380 83.99583 27.17339 973 FALSE 1 5 5.1
5381 83.99705 27.17430 974 FALSE 1 5 5.1
5382 83.99792 27.17552 975 FALSE 1 5 5.1
5383 83.99810 27.17690 976 FALSE 1 5 5.1
5384 83.99812 27.17700 977 FALSE 1 5 5.1
So you have 5 id's in your SpatialPolygonsDataFrame. Now, let's read the point.csv with read.csv function since the data is required to be in matrix format for the spseg function.
points <- read.csv("c://Users/cemozen/Downloads/example/points.csv")
tail(points)
group_m group_h x y
950 4.95 78.49000 84.32887 26.81203
951 5.30 86.22167 84.27448 26.76932
952 8.68 77.85333 84.33353 26.80942
953 7.75 82.34000 84.35270 26.82850
954 7.75 82.34000 84.35270 26.82850
955 7.75 82.34000 84.35270 26.82850
In the documentation and the example within, it has been strictly stated that; the row number of the points which have two attributes (group_m and group_h in our data), should be equal to the id number (which is the cities). Maybe, you should calculate a value by using the mean for each polygon or any other statistics for each city in your data to be able to get only one value for each polygon.
On the other hand, I just would like to show that the function is working properly after feeding with a matrix that has 5 rows and 2 groups.
sample_spseg <- spseg(bound, as.matrix(points[1:5,c("group_m", "group_h")]))
print(sample_spseg)
Reardon and O'Sullivan's spatial segregation measures
Dissimilarity (D) : 0.0209283
Relative diversity (R): -0.008781
Information theory (H): -0.0066197
Exposure/Isolation (P):
group_m group_h
group_m 0.07577679 0.9242232
group_h 0.07516285 0.9248372
--
The exposure/isolation matrix should be read horizontally.
Read 'help(spseg)' for more details.
first: I do not have experience with the seg-package and it's function.
What I read from your question, is that you want to perform the spseg-function, om the points within each area?
If so, here is a possible apprach:
library(sf)
library(tidyverse)
library(seg)
library(mapview) # for quick viewing only
# read polygons, make valif to avoid probp;ems later on
areas <- st_read("./temp/example/bound.shp") %>%
sf::st_make_valid()
# read points and convert to sf object
points <- read.csv("./temp/example/points.csv") %>%
sf::st_as_sf(coords = c("x", "y"), crs = 4326) %>%
#spatial join city (use st_intersection())
sf::st_join(areas)
# what do we have so far??
mapview::mapview(points, zcol = "city")
# get the coordinates back into a data.frame
mydata <- cbind(points, st_coordinates(points))
# drop the geometry, we do not need it anymore
st_geometry(mydata) <- NULL
# looks like...
head(mydata)
# group_m group_h city X Y
# 1 8.02 84.51 2 84.02780 27.31180
# 2 8.02 84.51 2 84.02780 27.31180
# 3 8.02 84.51 2 84.02780 27.31180
# 4 5.01 84.96 2 84.04308 27.27651
# 5 5.01 84.96 2 84.04622 27.27152
# 6 5.01 84.96 2 84.04622 27.27152
# Split to a list by city
L <- split(mydata, mydata$city)
# loop over list and perform sppseg function
final <- lapply(L, function(i) spseg(x = i[, 4:5], data = i[, 1:2]))
# test for the first city
final[[1]]
# Reardon and O'Sullivan's spatial segregation measures
#
# Dissimilarity (D) : 0.0063
# Relative diversity (R): -0.0088
# Information theory (H): -0.0067
# Exposure/Isolation (P):
# group_m group_h
# group_m 0.1160976 0.8839024
# group_h 0.1157357 0.8842643
# --
# The exposure/isolation matrix should be read horizontally.
# Read 'help(spseg)' for more details.
spplot(final[[1]], main = "Equal")
I have two large dataframes called intersections (representing intersections of a street system) and users (representing users of a network) as follows:
intersections has three columns: x,y and label_street. They respectively represent the position of an intersection in a squared observation window (say [0,5] x [0,5]) and the street it is located on. Here is an example:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
head(intersections)
x y label_street
1 0.147674 0.132956 5
2 0.235356 0.150813 6
3 0.095337 0.087345 5
4 0.147674 0.132956 6
An intersection being located at the crossing of several streets, every (x,y) combination in the intersections table appears at least twice, but with different label_street (e.g. rows 1 and 4 in the previous example). The label_street may not be the row number (which is why it starts at 5 in my example).
users has 4 columns: x,y, label_street, ID. They respectively represent the position of a user, the street it is located on and a unique ID per user. There are no duplicates in this dataframe, as a user is located on a unique street and has a unique ID. Here is an example (the ID and the label_street may not be the row number)
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), ID = c(2703, 3460, 4325, 12506, 19753, 21282))
head(users)
x y label_street ID
1 0.20428152 0.14448448 6 2703
2 0.17840619 0.13921481 6 3460
3 0.12964668 0.11724543 5 4325
4 0.20423856 0.14447573 6 12506
5 0.19349761 0.14228827 6 19753
6 0.10861251 0.09891443 5 21282
What I want to do is the following: for each point (x,y) of intersections, get the ID and the distance to its closest neighbour sharing the same street_label in users
I have a working solution using spatstat function nncross for nearest neighbour searching and plyr function adply for working on the data.
My working solution is as follows:
1) Write a user-defined function which gets the ID and the distance to the nearest neighbour of a row in a query table
NN <- function(row,query){
df <- row
window <- c(0,5,0,5) #Need this to convert to ppp objects and compute NN distance using nncross
NN <- nncross(as.ppp(row[,1:2],window),as.ppp(query[,1:2],window))
df$NN.ID <- query$ID[NN$which]
df$dist <- NN$dist
return(df)
}
2) Apply this user-defined function row-wise to my dataframe "intersections" with the query being the subset of users sharing the same street_label as the row :
result <- adply(intersections, 1, function(row) NN(row, users[users$label_street == row$label_street, ])
The result is as follows on the example:
head(result)
x y label_street NN.ID NN.dist
1 0.147674 0.132956 5 4325 0.02391247
2 0.235356 0.150813 6 2703 0.03171236
3 0.095337 0.087345 5 21282 0.01760940
4 0.147674 0.132956 6 3460 0.03136304
Since my real dataframes will be huge, I think computing distance matrices for looking at the nearest neighbour won't be efficient and that adply will be slow.
Does anyone have an idea of a data.table like solution? I only now about the basics of data.table and have always found it very efficient compared to plyr.
This solution uses the RANN package to find nearest neighbours. The trick is to first ensure that elements with different label_street have a higher distance between them than elements within the same label_street. We do this by adding an additional numeric column with a very large value that is constant within the same label_street but different between different values of label_street. In total, you get:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), number = c(2703, 3460, 4325, 12506, 19753, 21282))
# add a numeric column that is constant within each category and has a very large value
intersections$label_street_large <- intersections$label_street * 1e6
users$label_street_large <- users$label_street * 1e6
# call the nearest neighbour function (k = 1 neighbour)
nearest_neighbours <- RANN::nn2(
intersections[, c("x", "y", "label_street_large")],
users[, c("x", "y", "label_street_large")],
k = 1
)
# get original IDs and distances
IDs <- users$number[c(nearest_neighbours$nn.idx)]
distances <- c(nearest_neighbours$nn.dists)
IDs
# [1] 3460 12506 2703 3460 3460 4325
distances
# [1] 0.03171236 0.03136304 0.02391247 0.03175620 0.04271763 0.01760940
I hope this helps you. It should be very fast because it only call nn2 once, which runs in O(N * log(N)) time.
I am trying to calculate and save an output file that gives all the distances from long/lat coordinates associated to multiple samples in R.
Example of data:
Sample Latitude Longitude
A 70 141
B 72 142
C 71 143
D 69 141
I am currently using the geosphere package in r, and specifically the distVincentyEllipsoid function. You can use it like this:
distVincentyEllipsoid(p1 = c(141,70), p2 = c(142,72))
But this only gives you one distance between tWo samples at time and I need to get distances between all samples, among 15 samples, and write them to an output file listing samples and associated distances.
Example output:
Samples Distance(m)
A-B 8
A-C 26
B-C 13
A-D 20
Thanks.
Here is another solution with the outer function.
library(geosphere)
myList <- setNames(split(df[,c(3,2)], seq_len(nrow(df))), df$Sample)
distMat <- outer(myList, myList, Vectorize(distVincentyEllipsoid))
This gives a distance matrix whose distance is defined by distVincentyEllipsoid. And the result as follows:
> distMat
A B C D
A 0.0 226082.2 134163.1 111555.6
B 226082.2 0.0 117066.1 336761.1
C 134163.1 117066.1 0.0 235802.0
D 111555.6 336761.1 235802.0 0.0
Convert it to the format you want.
library(tidyr); library(dplyr)
distMat[lower.tri(distMat)] <- 0
distDf <- data.frame(distMat)
distDf$P1 <- row.names(distDf)
gather(distDf, P2, Distance, -P1) %>% filter(Distance != 0) %>%
mutate(Sample = paste(P1, P2, sep = "-")) %>% select(Sample, Distance)
Sample Distance
1 A-B 226082.2
2 A-C 134163.1
3 B-C 117066.1
4 A-D 111555.6
5 B-D 336761.1
6 C-D 235802.0
Note: don't have time to compare the efficiency, but since this solution avoids the high level sampling data from the original data frame. It should be relatively fast.
You can do this in this way:
sample_names <- data$Sample
nrow_data <- nrow(data)
test <- function(x){
return (list(Sample = paste(sample_names[x[1]],sample_names[x[2]],sep='-'),
Distance.m = distVincentyEllipsoid(p1 = data[x[1],3:2], p2 = data[x[2],3:2])))
}
ans <- combn(1:nrow_data,2,test)
ans_df <- data.frame(Sample = unlist(ans[1,]),Distance.m = unlist(ans[2,]))
## Sample Distance.m
##1 A-B 226082.2
##2 A-C 134163.1
##3 A-D 111555.6
##4 B-C 117066.1
##5 B-D 336761.1
##6 C-D 235802.0
So what you want is each combination of the two locations, and then the associated positions,
you can do this with joins and the data.table package
library(data.table)
library(geosphere)
testdata <- data.table(Sample = LETTERS[1:4],
Latitude = c(70,72,71,69),
Longitude = c(141,142,143,141))
# Create each pair of combinations with combn
combTable <- rbindlist(combn(testdata$Sample,2,simplify = FALSE,FUN = as.list))
# Join on the first column
setkey(testdata,Sample)
setkey(combTable,V1)
combTable <- testdata[combTable]
#Join on the second column
setkey(combTable,V2)
combTable <- testdata[combTable]
# Mapply to fit the function's requirements of two vectors for each call
combTable[,.(dist = mapply(function(Lat1, Lon1, Lat2, Lon2)
distVincentyEllipsoid(c(Lon1, Lat1), c(Lon2, Lat2)),
Latitude,
Longitude,
i.Latitude,
i.Longitude,
SIMPLIFY =FALSE ),
Sample,
i.Sample)]
EDIT: doing this in one step without storing intermediate variables, and per #Arun's comment (And using magrittr syntax):
library(magrittr)
combTable <-
testdata[combTable, on = c('Sample' = 'V1')] %>%
testdata[., on = c(`Sample` = 'V2')] %>%
.[,.(dist = mapply(function(Lat1, Lon1, Lat2, Lon2)
distVincentyEllipsoid(c(Lon1, Lat1),c(Lon2, Lat2)),
Latitude,
Longitude,
i.Latitude,
i.Longitude,
SIMPLIFY = FALSE),
Sample,
i.Sample)]
I have a problem with spatial data.
I need to extract temperature data from a NetCDF file; then I need to associate this temperature at given latitude and longitude to another set of latitude and longitude contained in a different dataframe.
This is the code I used to extract my variables:
myfile <- nc_open(paste(wd, 'myfile.nc', sep=''))
timearr = ncvar_get(myfile, "time")
temp <- ncvar_get(myfile, 'temp_srf')
lat <- ncvar_get(myfile, 'lat_rho')
lon <- ncvar_get(myfile, 'lon_rho')
dim(temp)
[1] 27 75 52 # which means: 27 longitude * 75 latitudes * 52 time steps
I chose to work on the first time step of temperature for now. So:
> t1 <- as.vector(temp[,,1])
Then I created a data.frame including lat, lon and temperature in the first time step:
lat1 <- as.vector(lat)
lon1 <- as.vector(lon)
df1 <- as.data.frame(cbind(lon1, lat1, t1))
head(df1)
lon1 lat1 t1
1 18.15338 40.48656 13.96225
2 18.24083 40.55126 14.36726
3 18.32845 40.61589 14.53822
4 18.41627 40.68045 14.78643
5 18.50427 40.74495 14.88624
6 18.59246 40.80938 14.95925
In another data frame (df2) I have some random points of latitude and longitude, that I have to associate to the closest latitude and longitude of the previous data.frame:
> df2 <- read.csv(paste(id, "myfile.csv", sep=""), header=TRUE, sep=",")
> head(df2)
LONs LATs
1 14.13189 43.41072
2 14.13342 43.34871
3 14.09980 43.40822
4 14.05338 43.72771
5 13.91311 43.88051
6 13.98500 43.91164
I was thinking to get the distance between each point and get the lowest one, but I don't know how to do it. Not sure if there are other solutions.
I am assuming your data are projected coordinates, and that you need to calculate great circle distances. You can use a formula yourself (see my answer here), or you can use rdist.earth from the package fields. For each entry in df2, calculate the distance from all entries in df1, find the index of the minimum distance in that vector, and use that index to select the appropriate row df1 to assign temp to df2. It only takes one line (but it might be clearer to seperate the steps over a few commands):
require( fields )
df2["Temp"] <- df1[ sapply( seq_len( nrow(df2) ) , function(x){ which.min( rdist.earth( df2[x,] , as.matrix( df1[ c("lon1" , "lat1") ] ) , miles = FALSE, R = 6371 ) ) } ) , "t1" ]
And the results using your data:
df1
# lon1 lat1 t1
# 1 18.15338 40.48656 13.96225
# 2 18.24083 40.55126 14.36726
# 3 18.32845 40.61589 14.53822
# 4 18.41627 40.68045 14.78643
# 5 18.50427 40.74495 14.88624
# 6 18.59246 40.80938 14.95925
df2
# LONs LATs Temp
# 1 14.13189 43.41072 13.96225
# 2 14.13342 43.34871 13.96225
# 3 14.09980 43.40822 13.96225
# 4 14.05338 43.72771 14.53822
# 5 13.91311 43.88051 14.53822
# 6 13.98500 43.91164 14.78643
It looks like your distances are at least a Km apart (>300km in this data) so you should get good accuracy with the Great Circle formula. If they are smaller than 1km you may want to use the Haversine formula.
Two formulas I like for getting the distance between two lat/long coordinates are the Haversine formula and Vincenty's formula. The Haversine formula is a simpler formula that assumes Earth is a perfect sphere. You will probably get accuracy to a few feet. If you need a higher level of accuracy, try Vincenty's formula. It's spheroid based which attempts to account for Earth's imperfect sphere shape. The samples on the links aren't in R but it shouldn't be difficult to rewrite them in R.
The attached plot (Manhattan plot) contains on the x axis chromosome positions from the genome and on the Y axis -log(p), where p is a p-value associated with the points (variants) from that specific position.
I have used the following R code to generate it (from the gap package) :
require(gap)
affy <-c(40220, 41400, 33801, 32334, 32056, 31470, 25835, 27457, 22864, 28501, 26273,
24954, 19188, 15721, 14356, 15309, 11281, 14881, 6399, 12400, 7125, 6207)
CM <- cumsum(affy)
n.markers <- sum(affy)
n.chr <- length(affy)
test <- data.frame(chr=rep(1:n.chr,affy),pos=1:n.markers,p=runif(n.markers))
oldpar <- par()
par(cex=0.6)
colors <- c("red","blue","green","cyan","yellow","gray","magenta","red","blue","green", "cyan","yellow","gray","magenta","red","blue","green","cyan","yellow","gray","magenta","red")
mhtplot(test,control=mht.control(colors=colors),pch=19,bg=colors)
> head(test)
chr pos p
1 1 1 0.79296584
2 1 2 0.96675136
3 1 3 0.43870076
4 1 4 0.79825513
5 1 5 0.87554143
6 1 6 0.01207523
I am interested in getting the coordinates of the peaks of the plot above a certain threshold (-log(p)) .
If you want the indices of the values above the 99th percentile:
# Add new column with log values
test = transform(test, log_p = -log10(test[["p"]]))
# Get the 99th percentile
pct99 = quantile(test[["log_p"]], 0.99)
...and get the values from the original data test:
peaks = test[test[["log_p"]] > pct99,]
> head(peaks)
chr pos p log_p
5 1 5 0.002798126 2.553133
135 1 135 0.003077302 2.511830
211 1 211 0.003174833 2.498279
586 1 586 0.005766859 2.239061
598 1 598 0.008864987 2.052322
790 1 790 0.001284629 2.891222
You can use this with any threshold. Note that I have not calculated the first derivative, see this question for some pointers:
How to calculate first derivative of time series
after calculating the first derivative, you can find the peaks by looking at points in the timeseries where the first derivative is (almost) zero. After identifying these peaks, you can check which ones are above the threshold.
Based on my experience after plotting the graph you can use following R code to find the peak coordinate
plot(x[,1], x[,2])
identify(x[,1], x[,2], labels=row.names(x))
note here x[,1] refers to x coordinate(genome coordinate and x[,2] would be #your -log10P value
at this time use point you mouse to select a point and hit enter which #will give you peak location and then type the following code to get the #coordinate
coords <- locator(type="l")
coords