I have two large dataframes called intersections (representing intersections of a street system) and users (representing users of a network) as follows:
intersections has three columns: x,y and label_street. They respectively represent the position of an intersection in a squared observation window (say [0,5] x [0,5]) and the street it is located on. Here is an example:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
head(intersections)
x y label_street
1 0.147674 0.132956 5
2 0.235356 0.150813 6
3 0.095337 0.087345 5
4 0.147674 0.132956 6
An intersection being located at the crossing of several streets, every (x,y) combination in the intersections table appears at least twice, but with different label_street (e.g. rows 1 and 4 in the previous example). The label_street may not be the row number (which is why it starts at 5 in my example).
users has 4 columns: x,y, label_street, ID. They respectively represent the position of a user, the street it is located on and a unique ID per user. There are no duplicates in this dataframe, as a user is located on a unique street and has a unique ID. Here is an example (the ID and the label_street may not be the row number)
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), ID = c(2703, 3460, 4325, 12506, 19753, 21282))
head(users)
x y label_street ID
1 0.20428152 0.14448448 6 2703
2 0.17840619 0.13921481 6 3460
3 0.12964668 0.11724543 5 4325
4 0.20423856 0.14447573 6 12506
5 0.19349761 0.14228827 6 19753
6 0.10861251 0.09891443 5 21282
What I want to do is the following: for each point (x,y) of intersections, get the ID and the distance to its closest neighbour sharing the same street_label in users
I have a working solution using spatstat function nncross for nearest neighbour searching and plyr function adply for working on the data.
My working solution is as follows:
1) Write a user-defined function which gets the ID and the distance to the nearest neighbour of a row in a query table
NN <- function(row,query){
df <- row
window <- c(0,5,0,5) #Need this to convert to ppp objects and compute NN distance using nncross
NN <- nncross(as.ppp(row[,1:2],window),as.ppp(query[,1:2],window))
df$NN.ID <- query$ID[NN$which]
df$dist <- NN$dist
return(df)
}
2) Apply this user-defined function row-wise to my dataframe "intersections" with the query being the subset of users sharing the same street_label as the row :
result <- adply(intersections, 1, function(row) NN(row, users[users$label_street == row$label_street, ])
The result is as follows on the example:
head(result)
x y label_street NN.ID NN.dist
1 0.147674 0.132956 5 4325 0.02391247
2 0.235356 0.150813 6 2703 0.03171236
3 0.095337 0.087345 5 21282 0.01760940
4 0.147674 0.132956 6 3460 0.03136304
Since my real dataframes will be huge, I think computing distance matrices for looking at the nearest neighbour won't be efficient and that adply will be slow.
Does anyone have an idea of a data.table like solution? I only now about the basics of data.table and have always found it very efficient compared to plyr.
This solution uses the RANN package to find nearest neighbours. The trick is to first ensure that elements with different label_street have a higher distance between them than elements within the same label_street. We do this by adding an additional numeric column with a very large value that is constant within the same label_street but different between different values of label_street. In total, you get:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), number = c(2703, 3460, 4325, 12506, 19753, 21282))
# add a numeric column that is constant within each category and has a very large value
intersections$label_street_large <- intersections$label_street * 1e6
users$label_street_large <- users$label_street * 1e6
# call the nearest neighbour function (k = 1 neighbour)
nearest_neighbours <- RANN::nn2(
intersections[, c("x", "y", "label_street_large")],
users[, c("x", "y", "label_street_large")],
k = 1
)
# get original IDs and distances
IDs <- users$number[c(nearest_neighbours$nn.idx)]
distances <- c(nearest_neighbours$nn.dists)
IDs
# [1] 3460 12506 2703 3460 3460 4325
distances
# [1] 0.03171236 0.03136304 0.02391247 0.03175620 0.04271763 0.01760940
I hope this helps you. It should be very fast because it only call nn2 once, which runs in O(N * log(N)) time.
Related
There is five polygons for five different cities (see attached file in the link, it's called bound.shp). I also have a point file "points.csv" with longitude and latitude where for each point I know the proportion of people belonging to group m and group h.
I am trying to calculate the spatial segregation proposed by Reardon and O’Sullivan, “Measures of Spatial Segregation”
There is a package called "seg" which should allow us to do it. I am trying to do it but so far no success.
Here is the link to the example file: LINK. After downloading the "example". This is what I do:
setwd("~/example")
library(seg)
library(sf)
bound <- st_read("bound.shp")
points <- st_read("points.csv", options=c("X_POSSIBLE_NAMES=x","Y_POSSIBLE_NAMES=y"))
#I apply the following formula
seg::spseg(bound, points[ ,c(group_m, group_h)] , smoothing = "kernel", sigma = bandwidth)
Error: 'x' must be a numeric matrix with two columns
Can someone help me solve this issue? Or is there an alternate method which I can use?
Thanks a lot.
I don't know what exactly spseg function does but when evaluating the spseg function in the seg package documentation;
First argument x should be dataframe or object of class Spatial.
Second argument data should be matrix or dataframe.
After evaluating the Examples for spseg function, it should have been noted that the data should have the same number of rows as the id number of the Spatial object. In your sample, the id is the cities that have different polygons.
First, let's examine the bound data;
setwd("~/example")
library(seg)
library(sf)
#For the fortify function
library(ggplot2)
bound <- st_read("bound.shp")
bound <- as_Spatial(bound)
class(bound)
"SpatialPolygonsDataFrame"
attr(,"package")
"sp"
tail(fortify(bound))
Regions defined for each Polygons
long lat order hole piece id group
5379 83.99410 27.17326 972 FALSE 1 5 5.1
5380 83.99583 27.17339 973 FALSE 1 5 5.1
5381 83.99705 27.17430 974 FALSE 1 5 5.1
5382 83.99792 27.17552 975 FALSE 1 5 5.1
5383 83.99810 27.17690 976 FALSE 1 5 5.1
5384 83.99812 27.17700 977 FALSE 1 5 5.1
So you have 5 id's in your SpatialPolygonsDataFrame. Now, let's read the point.csv with read.csv function since the data is required to be in matrix format for the spseg function.
points <- read.csv("c://Users/cemozen/Downloads/example/points.csv")
tail(points)
group_m group_h x y
950 4.95 78.49000 84.32887 26.81203
951 5.30 86.22167 84.27448 26.76932
952 8.68 77.85333 84.33353 26.80942
953 7.75 82.34000 84.35270 26.82850
954 7.75 82.34000 84.35270 26.82850
955 7.75 82.34000 84.35270 26.82850
In the documentation and the example within, it has been strictly stated that; the row number of the points which have two attributes (group_m and group_h in our data), should be equal to the id number (which is the cities). Maybe, you should calculate a value by using the mean for each polygon or any other statistics for each city in your data to be able to get only one value for each polygon.
On the other hand, I just would like to show that the function is working properly after feeding with a matrix that has 5 rows and 2 groups.
sample_spseg <- spseg(bound, as.matrix(points[1:5,c("group_m", "group_h")]))
print(sample_spseg)
Reardon and O'Sullivan's spatial segregation measures
Dissimilarity (D) : 0.0209283
Relative diversity (R): -0.008781
Information theory (H): -0.0066197
Exposure/Isolation (P):
group_m group_h
group_m 0.07577679 0.9242232
group_h 0.07516285 0.9248372
--
The exposure/isolation matrix should be read horizontally.
Read 'help(spseg)' for more details.
first: I do not have experience with the seg-package and it's function.
What I read from your question, is that you want to perform the spseg-function, om the points within each area?
If so, here is a possible apprach:
library(sf)
library(tidyverse)
library(seg)
library(mapview) # for quick viewing only
# read polygons, make valif to avoid probp;ems later on
areas <- st_read("./temp/example/bound.shp") %>%
sf::st_make_valid()
# read points and convert to sf object
points <- read.csv("./temp/example/points.csv") %>%
sf::st_as_sf(coords = c("x", "y"), crs = 4326) %>%
#spatial join city (use st_intersection())
sf::st_join(areas)
# what do we have so far??
mapview::mapview(points, zcol = "city")
# get the coordinates back into a data.frame
mydata <- cbind(points, st_coordinates(points))
# drop the geometry, we do not need it anymore
st_geometry(mydata) <- NULL
# looks like...
head(mydata)
# group_m group_h city X Y
# 1 8.02 84.51 2 84.02780 27.31180
# 2 8.02 84.51 2 84.02780 27.31180
# 3 8.02 84.51 2 84.02780 27.31180
# 4 5.01 84.96 2 84.04308 27.27651
# 5 5.01 84.96 2 84.04622 27.27152
# 6 5.01 84.96 2 84.04622 27.27152
# Split to a list by city
L <- split(mydata, mydata$city)
# loop over list and perform sppseg function
final <- lapply(L, function(i) spseg(x = i[, 4:5], data = i[, 1:2]))
# test for the first city
final[[1]]
# Reardon and O'Sullivan's spatial segregation measures
#
# Dissimilarity (D) : 0.0063
# Relative diversity (R): -0.0088
# Information theory (H): -0.0067
# Exposure/Isolation (P):
# group_m group_h
# group_m 0.1160976 0.8839024
# group_h 0.1157357 0.8842643
# --
# The exposure/isolation matrix should be read horizontally.
# Read 'help(spseg)' for more details.
spplot(final[[1]], main = "Equal")
I have a df called 'covs' with sites on rows and in columns, 9 different environmental variables for each of these sites. I need to recalculate the value of each cell using the function x - center_values(x)) / scale_values(x). However, 'center_values' and 'scale_values' are different for each environmental covariate, and they are located in another df called 'correction'.
I have found many solutions for applying a function for a whole df, but not for applying specific values according to the id of the value to transform.
covs <- read.table(text = "X elev builtup river grip pa npp treecov
384879-2009 1 24.379101 25188.572 1241.8348 1431.1082 5.705152e+03 16536.664 60.23175
385822-2009 2 29.533478 32821.770 2748.9053 1361.7772 2.358533e+03 15773.115 62.38455
385823-2009 3 30.097059 28358.244 2525.7627 1073.8772 4.340906e+03 14899.451 46.03269
386765-2009 4 33.877861 40557.891 927.4295 1049.4838 4.580944e+03 15362.518 53.08151
386766-2009 5 38.605156 36182.801 1479.6178 1056.2130 2.517869e+03 13389.958 35.71379",
header= TRUE)
correction <- read.table(text = "var_name center_values scale_values
1 X 196.5 113.304898393671
2 elev 200.217889868483 307.718211316278
3 builtup 31624.4888660664 23553.2438790344
4 river 1390.41023742909 1549.88661649406
5 grip 5972.67361738244 6996.57793554527
6 pa 2731.33431010861 4504.71055521749
7 npp 10205.2997576655 2913.19658598938
8 treecov 47.9080656134352 17.7101565911347
9 nonveg 7.96755640452006 4.56625351682905", header= TRUE)
Could someone help me write a code to recalculate the environmental covariate values in 'covs' using the specific covariate values reported in 'correction'? E.g. For each value in the column 'elev' of the df 'covs', I need to substract the 'center_value' reported for 'elev' in the 'corrected' df, and then divided by the 'scale_value' of 'elev' reported in 'corrected' df. Thank you for your kind help.
You may assign var_name to row names, then loop over the names of covs to do the calculations in an sapply.
rownames(correction) <- correction$var_name
res <- as.data.frame(sapply(names(covs), function(x, y)
(covs[, x] - correction[x, "center_values"])/correction[x, "scale_values"]))
res
# X elev builtup river grip pa npp treecov
# 1 -1.725433 -0.5714280 -0.27324970 -0.09586213 -0.6491124 0.66015733 2.173339 0.6958541
# 2 -1.716607 -0.5546776 0.05083296 0.87651254 -0.6590217 -0.08275811 1.911239 0.8174114
# 3 -1.707781 -0.5528462 -0.13867495 0.73253905 -0.7001703 0.35730857 1.611340 -0.1058927
# 4 -1.698956 -0.5405596 0.37928543 -0.29871910 -0.7036568 0.41059457 1.770295 0.2921174
# 5 -1.690130 -0.5251972 0.19353224 0.05755748 -0.7026950 -0.04738713 1.093183 -0.6885470
Check e.g. "elev":
(covs[,"elev"] - correction["elev", "center_values"]) / correction["elev", "scale_values"]
# [1] -0.5714280 -0.5546776 -0.5528462 -0.5405596 -0.5251972
I have a problem with spatial data.
I need to extract temperature data from a NetCDF file; then I need to associate this temperature at given latitude and longitude to another set of latitude and longitude contained in a different dataframe.
This is the code I used to extract my variables:
myfile <- nc_open(paste(wd, 'myfile.nc', sep=''))
timearr = ncvar_get(myfile, "time")
temp <- ncvar_get(myfile, 'temp_srf')
lat <- ncvar_get(myfile, 'lat_rho')
lon <- ncvar_get(myfile, 'lon_rho')
dim(temp)
[1] 27 75 52 # which means: 27 longitude * 75 latitudes * 52 time steps
I chose to work on the first time step of temperature for now. So:
> t1 <- as.vector(temp[,,1])
Then I created a data.frame including lat, lon and temperature in the first time step:
lat1 <- as.vector(lat)
lon1 <- as.vector(lon)
df1 <- as.data.frame(cbind(lon1, lat1, t1))
head(df1)
lon1 lat1 t1
1 18.15338 40.48656 13.96225
2 18.24083 40.55126 14.36726
3 18.32845 40.61589 14.53822
4 18.41627 40.68045 14.78643
5 18.50427 40.74495 14.88624
6 18.59246 40.80938 14.95925
In another data frame (df2) I have some random points of latitude and longitude, that I have to associate to the closest latitude and longitude of the previous data.frame:
> df2 <- read.csv(paste(id, "myfile.csv", sep=""), header=TRUE, sep=",")
> head(df2)
LONs LATs
1 14.13189 43.41072
2 14.13342 43.34871
3 14.09980 43.40822
4 14.05338 43.72771
5 13.91311 43.88051
6 13.98500 43.91164
I was thinking to get the distance between each point and get the lowest one, but I don't know how to do it. Not sure if there are other solutions.
I am assuming your data are projected coordinates, and that you need to calculate great circle distances. You can use a formula yourself (see my answer here), or you can use rdist.earth from the package fields. For each entry in df2, calculate the distance from all entries in df1, find the index of the minimum distance in that vector, and use that index to select the appropriate row df1 to assign temp to df2. It only takes one line (but it might be clearer to seperate the steps over a few commands):
require( fields )
df2["Temp"] <- df1[ sapply( seq_len( nrow(df2) ) , function(x){ which.min( rdist.earth( df2[x,] , as.matrix( df1[ c("lon1" , "lat1") ] ) , miles = FALSE, R = 6371 ) ) } ) , "t1" ]
And the results using your data:
df1
# lon1 lat1 t1
# 1 18.15338 40.48656 13.96225
# 2 18.24083 40.55126 14.36726
# 3 18.32845 40.61589 14.53822
# 4 18.41627 40.68045 14.78643
# 5 18.50427 40.74495 14.88624
# 6 18.59246 40.80938 14.95925
df2
# LONs LATs Temp
# 1 14.13189 43.41072 13.96225
# 2 14.13342 43.34871 13.96225
# 3 14.09980 43.40822 13.96225
# 4 14.05338 43.72771 14.53822
# 5 13.91311 43.88051 14.53822
# 6 13.98500 43.91164 14.78643
It looks like your distances are at least a Km apart (>300km in this data) so you should get good accuracy with the Great Circle formula. If they are smaller than 1km you may want to use the Haversine formula.
Two formulas I like for getting the distance between two lat/long coordinates are the Haversine formula and Vincenty's formula. The Haversine formula is a simpler formula that assumes Earth is a perfect sphere. You will probably get accuracy to a few feet. If you need a higher level of accuracy, try Vincenty's formula. It's spheroid based which attempts to account for Earth's imperfect sphere shape. The samples on the links aren't in R but it shouldn't be difficult to rewrite them in R.
The attached plot (Manhattan plot) contains on the x axis chromosome positions from the genome and on the Y axis -log(p), where p is a p-value associated with the points (variants) from that specific position.
I have used the following R code to generate it (from the gap package) :
require(gap)
affy <-c(40220, 41400, 33801, 32334, 32056, 31470, 25835, 27457, 22864, 28501, 26273,
24954, 19188, 15721, 14356, 15309, 11281, 14881, 6399, 12400, 7125, 6207)
CM <- cumsum(affy)
n.markers <- sum(affy)
n.chr <- length(affy)
test <- data.frame(chr=rep(1:n.chr,affy),pos=1:n.markers,p=runif(n.markers))
oldpar <- par()
par(cex=0.6)
colors <- c("red","blue","green","cyan","yellow","gray","magenta","red","blue","green", "cyan","yellow","gray","magenta","red","blue","green","cyan","yellow","gray","magenta","red")
mhtplot(test,control=mht.control(colors=colors),pch=19,bg=colors)
> head(test)
chr pos p
1 1 1 0.79296584
2 1 2 0.96675136
3 1 3 0.43870076
4 1 4 0.79825513
5 1 5 0.87554143
6 1 6 0.01207523
I am interested in getting the coordinates of the peaks of the plot above a certain threshold (-log(p)) .
If you want the indices of the values above the 99th percentile:
# Add new column with log values
test = transform(test, log_p = -log10(test[["p"]]))
# Get the 99th percentile
pct99 = quantile(test[["log_p"]], 0.99)
...and get the values from the original data test:
peaks = test[test[["log_p"]] > pct99,]
> head(peaks)
chr pos p log_p
5 1 5 0.002798126 2.553133
135 1 135 0.003077302 2.511830
211 1 211 0.003174833 2.498279
586 1 586 0.005766859 2.239061
598 1 598 0.008864987 2.052322
790 1 790 0.001284629 2.891222
You can use this with any threshold. Note that I have not calculated the first derivative, see this question for some pointers:
How to calculate first derivative of time series
after calculating the first derivative, you can find the peaks by looking at points in the timeseries where the first derivative is (almost) zero. After identifying these peaks, you can check which ones are above the threshold.
Based on my experience after plotting the graph you can use following R code to find the peak coordinate
plot(x[,1], x[,2])
identify(x[,1], x[,2], labels=row.names(x))
note here x[,1] refers to x coordinate(genome coordinate and x[,2] would be #your -log10P value
at this time use point you mouse to select a point and hit enter which #will give you peak location and then type the following code to get the #coordinate
coords <- locator(type="l")
coords
I have 2 sets of points, set1 and set2. Both sets of points have a data associated with the point. Points in set1 are "ephemeral", and only exist on the given date. Points in set2 are "permanent", are constructed at a given date, and then exist forever after that date.
set.seed(1)
dates <- seq(as.Date('2011-01-01'),as.Date('2011-12-31'),by='days')
set1 <- data.frame(lat=40+runif(10000),
lon=-70+runif(10000),date=sample(dates,10000,replace=TRUE))
set2 <- data.frame(lat=40+runif(100),
lon=-70+runif(100),date=sample(dates,100,replace=TRUE))
Here's my problem: For each point in set1 (ephemeral) find the distance to the closest point in set2 (permanent) that was constructed BEFORE the event is set1 occurred. For example, the 1st point in set1 occurred on 2011-03-18:
> set1[1,]
lat lon date
1 40.26551 -69.93529 2011-03-18
So I want to find the closest point in set2 that was constructed before 2011-03-18:
> head(set2[set2$date<=as.Date('2011-04-08'),])
lat lon date
1 40.41531 -69.25765 2011-02-18
7 40.24690 -69.29812 2011-02-19
13 40.10250 -69.52515 2011-02-12
14 40.53675 -69.28134 2011-02-27
17 40.66236 -69.07396 2011-02-17
20 40.67351 -69.88217 2011-01-04
The additional wrinkle is that these are latitude/longitude points, so I have to calculate distances along the surface of the earth. The R package fields provides a convienent function to do this:
require(fields)
distMatrix <- rdist.earth(set1[,c('lon','lat')],
set2[,c('lon','lat')], miles = TRUE)
My question is, how can I adjust the distances in this matrix to Inf if the point in set2 (column of distance matrix) was constructed after the point in set1 (row of distances matrix)?
Here is what I would do:
earlierMatrix <- outer(set1$date, set2$date, "<=")
distMatrix2 <- distMatrix + ifelse(earlierMatrix, Inf, 0)
Here's my attempt at an answer. It's not particularly efficient, but I think it is correct. It also allows you to easily sub in different distance calculators:
#Calculate distances
require(fields)
distMatrix <- lapply(1:nrow(set1),function(x) {
#Find distances to all points
distances <- rdist.earth(set1[x,c('lon','lat')], set2[,c('lon','lat')], miles = TRUE)
#Set distance to Inf if the set1 point occured BEFORE the set2 dates
distances <- ifelse(set1[x,'date']<set2[,'date'], Inf, distances)
return(distances)
})
distMatrix <- do.call(rbind,distMatrix)
#Find distance to closest object
set1$dist <- apply(distMatrix,1,min)
#Find id of closest object
objectID <- lapply(1:nrow(set1),function(x) {
if (set1[x,'dist']<Inf) {
IDs <- which(set1[x,'dist']==distMatrix[x,])
} else {
IDs <- NA
}
return(sample(IDs,1)) #Randomly break ties (if there are any)
})
set1$objectID <- do.call(rbind,objectID)
Here's the head of the resulting dataset:
> head(set1)
lat lon date dist objectID
1 40.26551 -69.93529 2011-03-18 3.215514 13
2 40.37212 -69.32339 2011-02-11 10.320910 46
3 40.57285 -69.26463 2011-02-23 3.954132 4
4 40.90821 -69.88870 2011-04-24 4.132536 49
5 40.20168 -69.95335 2011-02-24 4.284692 45
6 40.89839 -69.86909 2011-07-12 3.385769 57