I have a data set correponding to observations of a real valued function of two variables, that is (z,x,y), where z=f(x,y).
I need to compute the cross derivative of f at the available data points, that is df/dxdy.
The function gradient from the pracma package offers a solution for this question but only when the observed points (x,y) come from a regular grid.
Is there any code available to do that ?
Best regards
Related
Could you fully explain the big difference in using the distm function or the distVincentyEllipsoid function to calculate the distance of geodesic coordinates in R?
I noticed that using distm for this calculation, it takes much longer. Could you please explain to me beyond the difference, why does this happen?
Thank you!
Following on from your previous question here: Distance calculation optimization in R
The speed relates to the level of computation required to produce the returned object, not necessarily the difference between the computation of distances (I am not sure what great circle computation the distm() function uses as it's default). Indeed the geosphere:: documentation here: https://cran.r-project.org/web/packages/geosphere/geosphere.pdf suggests that distVincentyEllipsoid() calculation is "very accurate" but "computationally more intensive" than other great circle methods while this would make you suspect a slower computation, it is because of the way I have structured the code in my answer to return a vector of distances between each row (not a matrix of distances between each and every point).
Conversely, your distm() calculation in your original code returns a matrix of multiple vectors between each and every point. For your problem, this is not necessary so long as the data is ordered, that is why I have done so. Additionally, the use of hierarchical clustering to cluster the points based on these distances into 3 (your defined number) clusters is also not necessary as we can use the percentile of distances between each point values to do the same. Again the speed benefit relates to computing the clusters on a single vector rather than a matrix.
Please note, I am a data analyst with a background in accounting/finance and not a GIS specialist by any means. That being said my use of the distVincentyEllipsoid() function comes from my general understanding that this returns a pretty accurate estimation of great circle distances as a vector (as a opposed to a matrix). Moreover, having used this in the past to optimise logistics operations for pricing purposes, I can attest to the fact these computations have been tested in the market and found to be sound.
I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/
I am working on using the k nearest neighbours with a certain variable identified(test) for determining the value of this same variable of an individual with this value non-identified(test). Two possible approaches can be done then:
first(easy one), calculate the mean value of the variable of the k individuals; second(best one), calculate a weighted distance value according to the proximity of the individuals.
My first approach has been using the knn.index function in FNN package for identifying the nearest neighbours, and then using the indexes, look for the values in the dataset to do the mean. This was working so slow, as the dataset is quite big. Is there any algorithm already implemented to do this calculation faster, and would it be possible to add weights according to distance?
After a week of trying to solve the problem, I found a function in R which was solving my question, this might help others who have strugled with the same issue.
The function is named kknn, and it is in the package KKNN. It lets you do a KNN regression, but weigthing the points by the distance.
I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?
Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R
2)Can anyone help me to modify the existing code to accept my distance measurements?
3) Or, is there another better way to determine the number of significant clusters?
I thank all in advance for your help.
Some comments...
About 1)
It is a good way to deal with different types of data.
You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal)
for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).
Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).
About 2)
daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.
About 3)
Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.
About pam:
http://en.wikipedia.org/wiki/K-medoids
http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html
You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.
I am currently using the 'Akima' interp routine in order to do 2d linear interpolation. I'm currently trying to do linear interpolations as best as I can by excluding the bad datpoints and interpolated values that depend upon them. I don't want to do any spline fitting just linear interpolation.
I can think of two ways to do this using the existing akima package;
by partitioning the 2d datasets into valid subsets that do not have missing data points, and then interpolating on each, and then merging the results.
or by setting the missing value to a nonsense value, (-1.0 in my case), and then marking the results where any interpolated value NA. Unfortunately, the indices of the interpolation nodes do not appear to be returned, so I'll have to find these nodes myself in which case I should just write my own routine.
Each is a a bit of a pain and I'm sure there must be a better way or there must be a package to do one of the above as this I'm sure is a common problems that many have had.
Any recommendations for an alternative interpolation routine or method to use akima interp is greatly appreciated.
Bob
Have you looked at the Amelia package?