I am trying to run a mixed effect model using the 'glmmtmb' package with a spatial covariance structure that accounts for distances between points on a sphere. I have dug into the source code and identified where I think they calculate the Euclidean distances for the spatial covariance structure. I know euclidean distances are used based on this website:
https://cran.r-project.org/web/packages/glmmTMB/vignettes/covstruct.html
By bringing up the source code:
trace(getReStruc, edit = T)
Line 44 is where they use the dist(coords) for that distance matrix.
I want to change that = code so that it calculates great circle distances instead of Euclidean ones. However, functions such as distHaversine() from the 'geosphere' packages require 4 arguments (lat of x1, long of x1, lat of x2, long of x2) so I can't just plug in:
geosphere::distHaversine(coords)
Does anyone have a work around for doing this? Any help would be really appreciated!
Related
I am currently trying to redo plots which could be found on p. 120 in the textbook "Statistical Analysis and Modelling of Spatial Point Patterns". The following information should be sufficient to help me without having a look into the mentioned textbook. Using the fantastic spatstat package, I try to simulate point patterns in the unit square resulting from inhomgenous Poisson point processes (IPP) with the intensity functions (a) $\lambda(x,y)=a*(x+y)$ (linear trend) and (b) $/lambda(r)=c*exp(-dr^2)$, with $r$ being the distance from the origin.
For (a) I did the following:
library(spatstat)
linear <- function(x,y,a) {a*(x+y)}
plot(rpoispp(lambda = linear, a=150))
The resulting plot is not too bad from my understanding. I am unable figuring out how to implement (b) and would appreciate any help.
Hopefully, understanding how the implemantation of (b) works helps me to fit a model to an observed point pattern, with only a few clusters, probably one, which is likely to stem from an IPP using ppm(pattern, function describing the simple model) or kppm.
Note. The reason I am asking this question is self-interest. I could easily retrieve the plots from the source, but this does not help me understanding how to implement intensities, or create and fit simple models to observed point patterns.
If my question is answered elsewhere I would appreciate the provision of links. Thank you!
If you want to code an intensity function as a function in the R language, then it should be a function of the spatial location (x,y).
In (b) the intensity function is $\lambda(x,y) = c exp(-d (x^2 + y^2))$ where we use the fact that the distance from the origin (0,0) to the point (x, y) is $r = sqrt(x^2 + y^2)$. The code is
lam <- function(x,y,c,d) { c * exp(- d * (x^2 + y^2))
In this example the value of lambda(x,y) depends only on the distance r, so we say loosely that "the intensity is a function of r", which may be the source of your confusion.
I want to implement DBSCAN in R on some GPS coordinates. I have a distance matrix (dist_matrix) that I fed into the following functions:
dbscan::dbscan(dis_matrix, eps=50, minPts = 5,borderPoints=TRUE)
fpc::dbscan(dis_matrix,eps = 50,MinPts = 5,method = "dist")
and Im getting very different results from both functions in terms of number of clusters and if a point is a noise point or belongs to a cluster. Basically, the results are inconsistent between two algorithms. I have no clue why they generate these very different results although here
http://www.sthda.com/english/wiki/wiki.php?id_contents=7940
we see for iris data, both functions did the same.
my distance matrix [is coming from a function (geosphere::distm) which calculates the spatial distance between more than 2000 coordinates.
Furthermore, I coded dbscan according to this psuedo-code
source: https://cse.buffalo.edu/~jing/cse601/fa13/materials/clustering_density.pdf
My results are equal to what I obtained from fpc package.
Can anyone notice why they are different. I already looked into both functions and haven't found anything.
The documentation of geosphere::distm says that it does not return a dist object but a matrix. dbscan::dbscan assumes that you have a data matrix and not distances. Convert your matrix into a dist object with as.dist first. THis should resolve the problem.
I was trying to use the dbscan package in R to try to cluster some spatial data. The dbscan::dbscan function takes eps and minpts as input. I have a dataframe with two columns longitude and latitude expressed in degree decimals like in the following:
df <- data.frame(lon = c(seq(1,5,1), seq(1,5,1)),
lat = c(1.1,3.1,1.2,4.1,2.1,2.2,3.2,2.4,1.4,5.1))
and I apply the algorithm:
db <- fpc::dbscan(df, eps = 1, MinPts = 2)
will eps here be defined in degrees or in some other unit ? I'm really trying to understand in which unit this maximum distance eps value is expressed so any help is appreciated
Never use the fpc package, always use dbscan::dbscan instead.
If you have latitude and longitude, you need to choose an appropriate distance function such as Haversine.
The default distance function, Euclidean, ignores the spherical nature of earth. The eps value then is a mixture of degrees latitude and longitude, but these do not correspond to uniform distances! One degree east at the equator is much farther than one degree east in Vancouver.
Even then, you need to pay attention to units. One implementation of Haversine may yield radians, another one meters, and of course someone crazy will work in miles.
Unfortunately, as far as I can tell, none of the R implementations can accelerate Haversine distance. So it may be much faster to cluster the data in ELKI instead (you need to add an index yourself though).
If your data is small enough, you can however use a precomputed distance matrix (dist object) in R. But that will take O(n²) time and memory, so it is not very scalable.
Can I apply DBSCAN with other features in addition to location ? and if it is available how can it be done through R or Spark ?
I tried preparing an R table of 3 columns one for latitude, longitude and score (the feature I wanna cluster upon in addition to space feature) and when tried running DBSCAN with the following R code, I get the following plot which tells that the algorithm makes clusters upon each pair of columns (long, lat), (long, score), (lat, score), ...
my R Code:
df = read.table("/home/ahmedelgamal/Desktop/preparedData")
var = dbscan(df, eps = .013)
plot(x = var, data = df)
and the plot I get:
You are misinterpreting the plot.
You don't get one result per plot, but all plots show the same clusters, only in different attributes.
But you also have the issue that the R version is (to my knowledge) only fast for Euclidean distance.
In your current code, points are neighbors if (lat[i]-lat[j])^2+(lon[i]-lon[j])^2+(score[i]-score[j])^2 <= eps^2. This bad because: 1. latitude and longitude are not Euclidean, you should be using haversine instead, and 2. your additional attribute has much larger scale and thus you pretty much only cluster points with near-zero score, and 3) your score attribute is skewed.
For this problrm you should probably be using Generalized DBSCAN. Points are similar if their haversine distance is less than e.g. 1 mile (you want to measure geographic distance here, not coordinates, because of distortion) and if their score differs by a factor of at most 1.1 (i.e. compare score[y] / score[x] or work in logspace?). Since you want both conditipns to hold, the usual Euclidean DBSCAN implementation is not yet enough, but you need a Generalized DBSCAN that allows multiple conditions. Look for an implementation of Generalized DBSCAN instead (I believe there id one in ELKI that you may be able to access from Spark), or implement it yourself. It's not very hard to do.
If quadratic runtime is okay for you, you can probably use any distance-matrix-based DBSCAN, and simply "hack" a binary distance matrix:
compute Haversine distances
compute Score dissimilarity
distance = 0 if haversine < distance-threshold and score-dissimilarity < score-threshold, otherwise 1.
run DBSCAN with precomputed distance matrix and eps=0.5 (since it is a binary matrix, don't change eps!)
It's reasonably fast, but needs O(n^2) memory. In my experience, the indexes of ELKI yield a good speedup if you have larger data, and are worth a try if you run out of memory or time.
You need to scale your data. V3 has a range which is much larger than the range for the V1 and V2 and thus DBSCAN currently mostly ignores V3.
In R you can use all sorts of metrics to build a distance matrix prior to clustering, e.g. binary distance, Manhattan distance, etc...
However, when it comes to choosing a linkage method (complete, average, single, etc...), these linkage all use euclidean distance. This does not seem particularly appropriate if you rely on a difference metric to build the distance matrix.
Is there a way (or a library...) to apply other distances to linkage methods when building a clustering tree?
Thanks!
I don't really get your question. For example, suppose I have the following data:
x <- matrix(rnorm(100), nrow=5)
then I can build a distance matrix using dist
##Changing the distance measure
d_e = dist(x, method="euclidean")
d_m = dist(x, method="maximum")
I can then cluster in however I want:
##Changing the clustering method
hclust(d_m, method="median")
If you have constructed a matrix that already represents the pairwise distances, use e.g.
hclust(as.dist(mx), method="single")
You might want to try using agnes, rather than hclust, and hand it a distance matrix. There's a nice tutorial on this here:
http://strata.uga.edu/software/pdf/clusterTutorial.pdf
From the tutorial, here's how you would generate and use a distance matrix for clustering:
> library(vegan)
# load library for distance functions
> mydata.bray <- vegdist(mydata, method="bray")
# calculates bray (=Sørenson) distances among samples
> mydata.bray.agnes <- agnes(mydata.bray)
# run the cluster analysis
I myself use Prof. Daniel Müllner's fastcluster library, which has exactly the same API as agnes but is orders of magnitude faster for large data sets.