I am analysing STATS19 road accident data, commendably made available to the public by the UK government. I would like to look at how clustered different types of accidents are. The "G function" (described here) can be used to measure the divergence of point patterns from cases of complete spatial randomness "CSR".
spatstat handles this kind of problem well, with the envelope function providing a visualisation for the extent to which the pattern diverges from the CSR for different distances.
As my colleague Dan Olner has pointed out, however, the results (shown below, showing great divergence from the CSR) do not necessarily show clustering - it could be simply that we are detecting the natural clustering of the road network, on which most road accidents occur. The plot below can be reproduced by cloning my GitHub repo and running the following (after running parts of WY.R):
r <- seq(0, sqrt(2)/6, by = 0.005)
acB1 <- elide(acB, scale = TRUE)
# acB1 <- acB1[1:50,] # for tiny subset
acB1 <- SpatialPoints(acB1)
# Calculate the G function for the points
envacB <- envelope(as(acB1, "ppp"), fun = Gest)
# Calculate the G function for the points
plot(envacB)
This issue is actually described by Adrian Baddeley (developer of spatstat) himself in the package's documentation:
points could be locations in one dimension (such as
road accidents recorded on a road network)
This is exactly the situation I am facing but I do not know how to modify the analysis presented above to constraint the CSR to (or better, near to - as not all accidents are precisely on the road - see below) the road network. (see data here).
One suggestion was to take random points from the road network and calculate the G function for this and compare it with my accident data, but that would not create a clear (statistically significant) bounding box. Any suggestions?
You are absolutely right that the perceived clustering could be due to the accidents occurring on the road network. This must be accounted for.
In spatstat the road network is represented by a "linnet" object, so you need to convert your road network to this format. I don't know the details of that, but I would guess you should look at the "shapefiles" vignette in spatstat (you might have to go through the line segment class "psp" to import things):
vignette("shapefiles", package="spatstat")
A point pattern on a linear network is of class "lpp", so this is the data format you need in the end. If you have managed to store your network as the linnet object "mynet" you should be able to do something like:
X <- as(acB1, "ppp")
X <- lpp(X, mynet)
This automatically projects your points onto the network. Now you can look at summary statistics on the network. I don't think the G function is implemented in this setup, but I know the K-function is (function "linearK"), so you could use that. The generic function envelope as you used in your code now calls envelope.lpp which makes sure that the CSR simulations also are generated on the network.
I hope some of this is useful albeit not very detailed. Have a look at the relevant help files in spatstat for more details:
help(lpp)
help(linnet)
help(linearK)
Do report back how you progress from here, then I (or more likely Adrian Baddeley) might be able to give you some more pointers.
Related
I recently dabbled a bit into point pattern analysis and wonder if there is any standard practice to create mark correlation structures varying with the inter-point distance of point locations. Clearly, I understand how to simulate independent marks, as it is frequently mentioned e.g.
library(spatstat)
data(finpines)
set.seed(0907)
marks(finpines) <- rnorm(npoints(finpines), 30, 5)
plot(finpines)
enter image description here
More generally speaking, assume we have a fair amount of points, say n=100 with coordinates x and y in an arbitrary observation window (e.g. rectangle). Every point carries a characteristic, for example the size of the point as a continuous variable. Also, we can examine every pairwise distance between the points. Is there a way to introduce correlation structure between the marks (of pairs of points) which depends on the inter-point distance between the point locations?
Furthermore, I am aware of the existence of mark analysing techniques like
fin <- markcorr(finpines, correction = "best")
plot(fin)
When it comes to interpretation, my lack of knowledge forces me to trust my colleagues (non-scientists). Besides, I looked at several references given in the documentation of the spatstat functions; especially, I had a look on "Statistical Analysis and Modelling of Spatial Point Patterns", p. 347, where inhibition and mutual stimulation as deviations from 1 (independence of marks) of the normalised mark correlation function are explained.
I think the best bet is to use a random field model conditional on your locations. Unfortunately the package RandomFields is not on CRAN at the moment, but hopefully it will return soon. I think it may be possible to install an old version of RandomFields from the archives if you want to get going immediately.
Thus, the procedure is:
Use spatstat to generate the random locations you like.
Extract the coordinates (either coords() or as.data.frame.ppp()).
Define a model in RandomFields (e.g. RMexp()).
Simulate the model on the given coordinates (RFsimulate()).
Convert back to a marked point pattern in spatstat (either ppp() or as.ppp()).
I'm working with a set of co-ordinates, and want to dynamically (I have many sets that need to go through this process) understand how many distinct groups there are within the data. My approach was to apply k-means to investigate whether it would find the centroids and I could go from there.
When plotting some data with 6 distinct clusters (visually) the k-means algorithm continues to ignore two significant clusters while putting many centroids into another.
See image below:
Red are the co-ordinate data points and blue are centroids that k-means has provided. In this specific case I've gone for 15 (arbitrary), but it still doesn't recognise those patches of data on the right hand side, rather putting a mid point between them while putting in 8 in the cluster in the top right.
Admittedly there are slightly more data points in the top right, but not by much.
I'm using the standard k-means algorithm in R and just feeding in x and y co-ordinates. I've tried standardising the data, but this doesn't make any difference.
Any thoughts on why this is, or other potential methodologies that could be applied to try and dynamically understand the number of distinct clusters there are in the data?
You could try with Self-organizing map:
this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM).
This algorithm is very good for clustering also because does not require a priori selection of the number of clusters (in k-mean you need to choose k, here no). In your case, it hopefully finds automatically the optimal number of cluster, and you can actually visualize it.
You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. Else you can go with R. Here you can find a blog post with a tutorial, and Cran package manual for SOM.
K-means is a randomized algorithm and it will get stuck in local minima.
Because of these problems, it is common to run k-means several times, and keep the result with least squares, I.e., the best of the local minima found.
is there a way to set the resolution parameter when using the function cluster_louvain to detect communities in igraph for R? It makes a lot of difference for the result, as this parameter is related to the hierarchical dissimilarity between nodes. Thank you.
The easiest way to do it is through the resolution package, available in this link https://github.com/analyxcompany/resolution
It is based on this paper http://arxiv.org/pdf/0812.1770.pdf
It pretty much has 2 functions cluster_resolution() and cluster_resolution_RandomOrderFULL().
In both you can state the resolution t and how many repetitions you want rep. And, you can just use the igraph object in the function.
cluster_resolution_RandomOrderFULL(g,t=0.5)
cluster_resolution_RandomOrderFULL(g,rep=20)
NOTE/EDIT: it will not accept signed networks though! I'm trying to either contact the owner of the code or costumize it myself to make it suitable for signed networks.
EDIT2: I was able to translate the function community_louvain.m from the Brain Connectivity Toolbox for Matlab to R.
Here is the github link for the signed_louvain()
you can pretty much just put for ex. signed_louvain(g, gamma = 1, mod = 'modularity')
it works with igraph or matrix objects as input. If it has negative values, you have to choose mod = 'neg_sym' or 'neg_asym'.
I recently started to work with a huge dataset, provided by medical emergency
service. I have cca 25.000 spatial points of incidents.
I am searching books and internet for quite some time and am getting more and more confused about what to do and how to do it.
The points are, of course, very clustered. I calculated K, L and G function
for it and they confirm serious clustering.
I also have population point dataset - one point for every citizen, that is similarly clustered as incidents dataset (incidents happen to people, so there is a strong link between these two datasets).
I want to compare these two datasets to figure out, if they are similarly
distributed. I want to know, if there are places, where there are more
incidents, compared to population. In other words, I want to use population dataset to explain intensity and then figure out if the incident dataset corresponds to that intensity. The assumption is, that incidents should appear randomly regarding to population.
I want to get a plot of the region with information where there are more or less incidents than expected if the incidents were randomly happening to people.
How would you do it with R?
Should I use Kest or Kinhom to calculate K function?
I read the description, but still don't understand what is a basic difference
between them.
I tried using Kcross, but as I figured out, one of two datasets used
should be CSR - completely spatial random.
I also found Kcross.inhom, should I use that one for my data?
How can I get a plot (image) of incident deviations regarding population?
I hope I asked clearly.
Thank you for your time to read my question and
even more thanks if you can answer any of my questions.
Best regards!
Jernej
I do not have time to answer all your questions in full, but here are some pointers.
DISCLAIMER: I am a coauthor of the spatstat package and the book Spatial Point Patterns: Methodology and Applications with R so I have a preference for using these (and I genuinely believe these are the best tools for your problem).
Conceptual issue: How big is your study region and does it make sense to treat the points as distributed everywhere in the region or are they confined to be on the road network?
For now I will assume we can assume they are distributed anywhere.
A simple approach would be to estimate the population density using density.ppp and then fit a Poisson model to the incidents with the population density as the intensity using ppm. This would probably be a reasonable null model and if that fits the data well you can basically say that incidents happen "completely at random in space when controlling for the uneven population density". More info density.ppp and ppm are in chapters 6 and 9 of 1, respectively, and of course in the spatstat help files.
If you use summary statistics like the K/L/G/F/J-functions you should always use the inhom versions to take the population density into account. This is covered in chapter 7 of 1.
Also it could probably be interesting to see the relative risk (relrisk) if you combine all your points in to a marked point pattern with two types (background and incidents). See chapter 14 of 1.
Unfortunately, only chapters 3, 7 and 9 of 1 are availble as free to download sample chapters, but I hope you have access to it at your library or have the option of buying it.
I have a river network in a shapefile (class: "SpatialLinesDataFrame"), with some points on it (see picture below).
I would like to compute the distances between points, but along the rivers. I have been searching a lot and I am not able to find any function that allows directly that.
The closest thing I have found is the function "networkdistance" in the package "secrlinear", however I don't manage to transform my shapefile into the format required to use the function (a "linearmask" object).
Any help with this would be extremely appreciated.
Thanks in advance,
Tina.
I know this is an old thread, but just in case someone runs across this in the future: I just released an R package (riverdist) that deals with this issue, and also provides some tools for network editing and data summaries & visualization. It was written with fisheries work in mind, but could probably be applied to what you're working on, or at least that's the hope!
https://cran.r-project.org/web/packages/riverdist/vignettes/riverdist_vignette.html
Sorry this wasn't more timely -
I think we resolved this problem offline: the geographic coordinates (lat/long) of the shapefile needed to be projected before they could be used in secrlinear. That package approximates the linear network and uses igraph functions for distances.