I am using geoR package for spatial interpolation of rainfall. I have to tell that I am quite new to geostatistics. Thanks to some video tutorials in youtube, I understood (well, I think so) the theory behind variogram. As per my understanding, the number of pairs should decrease with increasing lag distances. For eg, if we consider a 100m long stretch (say 100m long cross section of a river bed) the number of pairs for 5m lag is 20 and number of pairs for 10m lag is 10 and so on. But I am kind of confused with output from variog function in geoRpackage. An example is given below
mydata
X Y a
[1,] 415720 432795 2.551415
[2,] 415513 432834 2.553177
[3,] 415325 432740 2.824652
[4,] 415356 432847 2.751844
[5,] 415374 432858 2.194091
[6,] 415426 432774 2.598897
[7,] 415395 432811 2.699066
[8,] 415626 432762 2.916368
this is my dataset where a is my variable (rainfall intensity) and x, y are the coordinates of the points. The varigram calculation is shown below
geodata=as.geodata(data,header=TRUE)
variogram=variog(geodata,coords=geodata$coords,data=geodata$data)
variogram[1:3]
$u
[1] 46.01662 107.37212 138.04987 199.40537 291.43861 352.79411
$v
[1] 0.044636453 0.025991469 0.109742986 0.029081575 0.006289056 0.041963076
$n
[1] 3 8 3 3 3 2
where
u: a vector with distances.
v: a vector with estimated variogram values at distances given in u.
n: number of pairs in each bin
According to this, number of pairs (n) have a random pattern whereas corresponding lag distance (u) is increasing. I find it hard to understand this. Can anyone explain what is happening? Also any suggestions/advice to improve the variogram calculation for this application (spatial interpolation of rainfall intensity) is highly appreciated as I am new to geostatistics. Thanks in advance.
On a linear transect of 100 m with 5 m regular spacing between observations, if you'd have 20 pairs at 5 m lag, you'd have 19 pairs at 10 m lag. This idea does not hold for your data, because they are irregularly distributed, and they are distributed over two dimensions. For irregularly distributed data, you often have very few point pairs for the very short distances. The advice for obtaining a better looking variogram is to work with a larger data set: geostatistics starts getting interesting with 30 observations, and fun with over 100 observations.
Related
I've used mclust to find clusters in a dataset. Now I want to implement these findings into external non-r software (predict.Mclust is thus not an option as has been suggested in previous similar Questions) to classify new observations. I need to know how mclust classifies observations.
Since mclust outputs a center and a covariance matrix for each cluster it felt reasonable to calculate mahalanobis distance for every observation and for every cluster. Observations could then be classified to the mahalonobi-nearest cluster. It seems not not to work fully however.
Example code with simulated data (in this example I only use one dataset, d, and try to obtain the same classification as mclust does by the mahalanobi approach outlined above):
set.seed(123)
c1<-mvrnorm(100,mu=c(0,0),Sigma=matrix(c(2,0,0,2),ncol=2))
c2<-mvrnorm(200,mu=c(3,3),Sigma=matrix(c(3,0,0,3),ncol=2))
d<-rbind(c1,c2)
m<-Mclust(d)
int_class<-m$classification
clust1_cov<-m$parameters$variance$sigma[,,1]
clust1_center<-m$parameters$mean[,1]
clust2_cov<-m$parameters$variance$sigma[,,2]
clust2_center<-m$parameters$mean[,2]
mahal_clust1<-mahalanobis(d,cov=clust1_cov,center=clust1_center)
mahal_clust2<-mahalanobis(d,cov=clust2_cov,center=clust2_center)
mahal_clust_dist<-cbind(mahal_clust1,mahal_clust2)
mahal_classification<-apply(mahal_clust_dist,1,function(x){
match(min(x),x)
})
table(int_class,mahal_classification)
#List mahalanobis distance for miss-classified observations:
mahal_clust_dist[mahal_classification!=int_class,]
plot(m,what="classification")
#Indicate miss-classified observations:
points(d[mahal_classification!=int_class,],pch="X")
#Results:
> table(int_class,mahal_classification)
mahal_classification
int_class 1 2
1 124 0
2 5 171
> mahal_clust_dist[mahal_classification!=int_class,]
mahal_clust1 mahal_clust2
[1,] 1.340450 1.978224
[2,] 1.607045 1.717490
[3,] 3.545037 3.938316
[4,] 4.647557 5.081306
[5,] 1.570491 2.193004
Five observations are classified differently between the mahalanobi approach and mclust. In the plots they are intermediate points between the two clusters. Could someone tell me why it does not work and how I could mimic the internal classification of mclust and predict.Mclust?
After formulating the above question I did some additional research (thx LoBu) and found that the key was to calculate the posterior probability (pp) for an observation to belong to a certain cluster and classify according to maximal pp. The following works:
denom<-rep(0,nrow(d))
pp_matrix<-matrix(rep(NA,nrow(d)*2),nrow=nrow(d))
for(i in 1:2){
denom<-denom+m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i])
}
for(i in 1:2){
pp_matrix[,i]<-m$parameters$pro[i]*dmvnorm(d,m$parameters$mean[,i],m$parameters$variance$sigma[,,i]) / denom
}
pp_class<-apply(pp_matrix,1,function(x){
match(max(x),x)
})
table(pp_class,m$classification)
#Result:
pp_class 1 2
1 124 0
2 0 176
But if someone in layman terms could explain the difference between the mahalanobi and pp approach I would be greatful. What do the "mixing probabilities" (m$parameters$pro) signify?
In addition to Mahalanobis distance, you also need to take the cluster weight into account.
These weight the relative importance of clusters when they overlap.
I needed to logarithmically distribute a number in a range of 2-200 in a determined number of intervals. The formula used in another question worked perfectly; I just couldn't explain how or why it works.
Can anyone explain how this function works? Perhaps derive it for me?
It's projecting evenly-spaced intervals to logarithmicly-spaced intervals by raising some number X to each interval.
The example distributes the range 1-1000 into 10 intervals. The end result is that some number X to the 10th power will be 1,000 (X^10 = 1,000). So X is 1,000 ^ (1.10) = 1.99526
Now the range
1 2 3 4 5 6 7 8 9 10
is projected to a logarithmic range using the function
f(x) = 1.99536 ^ x
which results in
1.99526, 3.98107, 7.94328, 15.8489, 31.6228, 63.0957, 125.893, 251.189, 501.187, 1000
I am trying to cluster a Multidimensional Functional Object with the "kmeans" algorithms. What does it mean: So I don't have anymore a vector per each row or Individual, even more a 3x3 observation matrix per each Individual.For example: Individual = 1 has the following observations:
(x1, x2, x3),(y1,y2,y3),(z1,z2,z3).
The same structure of observations is also given for the other Individuals. So do you know how I can cluster with "kmeans" including all 3 observation vectors -and not only one observation vector how it is normal used for "kmeans" clustering?
Would you do it for each observation vector, f.e. (x1, x2, x3), separately and then combine the Information somehow together? I want to do this with the kmeans() Function in R.
Many thanks for your answers!
Using k-means you interpret each observation as a point in an N-dimensional vector space. Then you minimize the distances between your observations and the cluster centers.
Since, the data is viewed as dots in an N-dim space, the actual arrangement of the values does not matter.
You can, therefore, either tell your k-means routine to use a matrix norm, for example the Frobenius norm, to compute the distances. The other way would be to flatten your observations from 3 by 3 matrices to 1 by 9 vectors. The Frobenius norm of a NxN matrix is equivalent to the euclidean norm of a 1xN^2 vector.
Just give the argument to kmeans() with all the three columns it'll calculate the distances in 3 dimension, if that is what you are looking for.
Suppose I have a trained SOM: mySom.
I want to test its quality. An interesting paper gives a suggestion: using summary(mySom). Doing that it results:
som map of size 5x5 with a hexagonal topology.
Training data included; dimension is 1017 by 24
Mean distance to the closest unit in the map: 0.02276404
So, mean(somres$distances) = 0.02276404 seems to be the mean distance of all the elements from the closest prototype.
Nevertheless, another measure should represent the same value: mySom$changes. Printing those values we find:
> somres$changes
[,1]
[1,] 0.0053652766
[2,] 0.0054470742
[3,] 0.0054121733
[4,] 0.0054452036
...
[97,] 0.0010324613
[98,] 0.0009807617
[99,] 0.0010183714
[100,] 0.0010220923
After having presented the inputs to the SOM 100 times we have a mean distance of every unit from the nearest one of: 0.0010220923.
Problem: mySom$changes[100] != mean(somres$distances). Why?
The first quality measure you describe "Mean distance to the closest unit in the map" is the quantization error of the som map, see http://www.ifs.tuwien.ac.at/~poelzlbauer/publications/Poe04WDA.pdf. It is computed by determining the average distance of the sample vectors to the cluster centroids by which they are represented. In case of the SOM, the cluster centroids are the prototype vectors. This value is computed after the training process.
The second one seem to be computed per iteration. In SOM, two values vary with iterations: the learning rate and the neighbourhood distance see this for a summary of SOM features. Try relating the 100 values to the initial and final values of the SOM parameters.
Say I have data concerning the position of animals on a 2d plane (as determined by video monitoring from a camera directly overhead). For example a matrix with 15 rows (1 for each animal) and 2 columns (x position and y position)
animal.ids<-letters[1:15]
xpos<-runif(15) # x coordinates
ypos<-runif(15) # y coordinates
raw.data.t1<-data.frame(xpos, ypos)
rownames(raw.data.t1) = animal.ids
I want to calculate all the pairwise distances between animals. That is, get the distance from animal a (row 1) to the animal in row 2, row3...row15, and then repeat that step for all rows, avoiding redundant distance calculations. The desire output of a function that does this would be the mean of all the pairwise distances. I should clarify that I mean the simple linear distance, from the formula d<-sqrt(((x1-x2)^2)+((y1-y2)^2)). Any help would be greatly appreciated.
Furthermore, how could this be extended to a similar matrix with an arbitrarily large even number of columns (every two columns representing x and y positions at a given time point). The goal here would be to calculate mean pairwise distances for every two columns and output a table with each time point and its corresponding mean pairwise distance. Here is an example of the data structure with 3 time points:
xpos1<-runif(15)
ypos1<-runif(15)
xpos2<-runif(15)
ypos2<-runif(15)
xpos3<-runif(15)
ypos3<-runif(15)
pos.data<-cbind(xpos1, ypos1, xpos2, ypos2, xpos3, ypos3)
rownames(pos.data) = letters[1:15]
The aptly named dist() will do this:
x <- matrix(rnorm(100), nrow=5)
dist(x)
1 2 3 4
2 7.734978
3 7.823720 5.376545
4 8.665365 5.429437 5.971924
5 7.105536 5.922752 5.134960 6.677726
See ?dist for more details
Why do you compare d<-sqrt(((x1-x2)^2)+((y1-y2)^2))?
Do d^2<-(((x1-x2)^2)+((y1-y2)^2)). It will cost you much less.