Determine mode locations of the kernel density estimate of multimodal univariate data - r

If I have a density function and I plot it with a particular bandwidth, I visually determine that there are 7 local maximums. I would just like to know how to plot separate distributions of the particular maximums on the same plot.
Also, if is possible to know exactly where the maximums occur by running some code? I can make ball-park estimates using the plot but is there an R function that I can use to get the exact points? I would like to know the mean and variance of the 7 densities that I have identified.
Specifically, I have the following:
plot(density(stamp, bw=0.0013,kernel = "gaussian"))

Determining which modes are real in a kernel density estimate is a matter of which bandwidth you chose to use. This is a complicated thing, and I don't advise choosing but a single bandwidth, as even different optimal rules of thumb can give you different answers. In general, the number of modes of a kde is less than the number of the underlying density in the oversmooothed case and more so in the undersmoothed case. There are many papers that cover this topic and give you many options to help determine the veracity of a mode. e.g., check out Silverman's mode test for Gaussian kernels, Friedman and Fisher's prim algorithm, Marron's siZer, and Minnotte and Scott's mode tree are good places to start.
A naive thing you can do, given a single KDE choice of bandwidth is check the run lengths.
In fact, with the bandwidth you have chosen, I find 9 modes. Just calculate the sign change of the difference in the series, and calculate the cumulative length of the runs in order to find the points. Every other point will be a mode or an antimode, depending on which came first. (You can check the sign to determine this)
library(BSDA)
dstamp <- density(Stamp$thickness, bw=0.0013, kernel = "gaussian")
chng <- cumsum(rle(sign(diff(dstamp$y)))$lengths)
plot(dstamp)
abline(v = dstamp$x[chng[seq(1,length(chng),2)]])

Since I needed something to get only the strongest modes, I created a dead simple algorithm that allows you to increase sensitivity by tweaking the number of density samples (to deacrease local noise) and put a minum density threshold, proportional to the max density (to decrease the global noise).
find_posterior_modes <- function(x, n.samples = 100, filter = .1) {
d <- density(x, n = n.samples)
x <- with(d, sapply(2:(n.samples - 1), function(i) if (y[i] > y[i - 1] & y[i] > y[i + 1] & y[i] > max(y) * filter) x[i]))
unlist(x)
}

I recently released the package ModEstM, it uses the same method as shayaa, with two features to suppress the less significant modes :
it is possible to choose the bandwidth of the density estimation, by choosing the "adjust" parameter of the density function,
the modes are presented in decreasing order of the corresponding density.

Related

Input to fit a power-law to degree distribution of a network

I would like to use R to test whether the degree distribution of a network behaves like a power-law with scale-free property. Nonetheless, I've read different people doing this in many different ways, and one confusing point is the input one should use in the model.
Barabasi, for example, recommends fitting a power-law to the 'complementary cumulative distribution' of degrees (see Advanced Topic 3.B of chapter 4, figure 4.22). However, I've seen people fit a power-law to the degrees of the graph (obtained with igraph::degree(g)), and I've also seen others fitting a power-law to a degree distribution, obtained via igraph::degree_distribution(g, cumulative = T)
As you can see in the reproducible example below, these options give very different results. Which one is correct? and how can I get the "complementary cumulative distribution of degrees" to from a graph so I can fit a power-law?
library(igraph)
# create a graph
set.seed(202)
g <- static.power.law.game(500, 1000, exponent.out= 2.2, exponent.in = 2.2, loops = FALSE, multiple = T)
# get input to fit power-law.
# 1) degrees of the nodes
d <- degree(g, v = V(g), mode ="all")
d <- d[ d > 0] # remove nodes with no connection
# OR ?
# 2) cumulative degree distribution
d <- degree_distribution(g, mode ="all", cumulative = T)
# Fit power law
fit <- fit_power_law(d, impelementation = "R.mle")
Well, the problem here is that you have 2 different statistics here.
The degree of a node shows how many connections it has to other nodes.
The degree distribution is the probability distribution of those degrees over the network.
For me it doesn't make much sense to apply the igraph::fit_power_law on a degree distribution as the degree distribution is already a power law to a certain extent.
However, don't forget that the igraph::fit_power_law has more options than the implementation argument, which will result in different things, depending on what you're "feeding it".

Automatically find the scaling factor of the x-axis using LsqFit (or other method)?

I have the following data: a vector B and a vector R. The vector B is the "independent" variable. For this pair, I have two data sets: One is an experimental measurement of Bex, Rex and the other is a simulation produced by me Bsim, Rsim. The simulation does not have any "scale" for the x-axis (the B vector). Therefore when I am trying to fit my curve to the experiment, I have to find out a scaling parameter B0 "by eye", and with this number B0 I multiply the entire Bsim vector and simply plot(Bsim, Rsim, Bex, Rex).
I wanted to use the package LsqFit to make the procedure automatic and more accurate. However I am having trouble in understanding how I could use it to find the scaling on the independent variable.
My first thought was to just "invert" the roles of B and R. However, there are two issues that I think make matters worse: 1) the R curve/data is not monotonous, 2) the experimental data are much more "dense" (they have more data-points: my simulation has 120 points in total, the experiments have some thousands).
Below I give an example if what I am trying to accomplish (of course, the answer need not use LsqFit). I also attach two figures that demonstrate everything very clearly.
#= stuff happened before this point =#
Bsim, Rsim = load(simulation)
Bex, Rex = load(experiment)
#this is what I want to do:
some_model(x, p) = ???
fit = curve_fit(some_model, Bex, Rex, [3.5])
B0 = fit.param[1]
#this is what I currently do by trail and error:
B0 = 3.85 #this is what I currently do by trial and error
plot(B0*Bsim, Rsim, Bex, Rex)
P.S.: The R curves (dependent variables) are both normalized by their maximum value because their scaling is not important.
A simple approach iff you can always expect both your experiment and simulation to feature one high peak, and you're sure that there's only a scaling factor rather than also an offset, is to simply multiply your Bsim vector by mode_rex / mode_rsim (e.g. in your example, mode_rsim = 1, and mode_rex = 4, so multiply Bsim by 4. But I'm sure you've thought of this already.
For a more general approach, one way is as follows:
add and load Interpolations package
Create a grid to interpolate over, e.g. Grid = 0:0.01:Bex[end]
interpolate Rex over that grid, e.g.
RexInterp = interpolate( (Bex,), Rex, Gridded(Linear()));
RexGridVec = RexInterp[Grid];
interpolate Rsim over the same grid, but introduce your multiplier on the Bsim "knots", e.g.
Multiplier = 0.1;
RsimInterp = interpolate( (Multiplier * Bsim,), Rsim, Gridded(Linear()));
RsimGridVec = RsimInterp[Grid]
Now you can calculate a square error value between RsimGridVec and RexGridVec, e.g.
SqErr = sum((RsimGridVec - RexGridVec).^2)
If you follow this technique, then if you create a loop for a multiplier range (say 0:0.01:10), and get the square error associated with each multiplier, you can find out the multiplier for which the square error is the minimum.
In theory if you wanted to find the optimal for a particular offset too, you can make it the outer loop for a range of offsets. Mind you this is a brute force approach, but it be reasonably efficient judging by the vectors in your graph.

A Feature Selection Algorithm POE1ACC for features with continuous value

i want to implement the algorithm of "Probability of Error and Average Correlation Coefficient". (more info Page 143. It is a algorithm to elect unused features from set of features. As far as i know, this algorithm is not limited to boolean valued features but i dont know how i can use it for continuous features.
This is the only example what i could find about this algorithm:
Thus, X is to be predicted feature and C is any feature. To calculate Probability of Error value of C, they select values which are mismatching with green pieces. Thus PoE of C is (1-7/9) + (1-6/7) = 3/16 = 1875.
My question is thus: How can we use a continuous feature instead of a boolean feature, like in this example, to calculate PoE? Or is it not possible?
The algorithm that you describe is a feature selection algorithm, similar to the forward selection technique. At each step, we find a new feature Fi that minimizes this criterion :
weight_1 * ErrorProbability(Fi) + weight_2 * Acc(Fi)
ACC(Fi) represents the mean correlation between the feature Fi and other features already selected. You want to minimize this in order to have all your features not correlated, thus have a well conditionned problem.
ErrorProbability(Fi) represents if the feature correctly describes the variable you want to predict. For example, lets say you want to predict if tommorow will be rainy depending on temperature (continuous feature)
The Bayes error rate is (http://en.wikipedia.org/wiki/Bayes_error_rate) :
P = Sum_Ci { Integral_xeHi { P(x|Ci)*P(Ci) } }
In our example
Ci belong to {rainy ; not rainy}
x are instances of temperatures
Hi represent all temperatures that would lead to a Ci prediction.
What is interesting is that you can take any predictor you like.
Now, suppose you have all temperatures in one vector, all states rainy/not rainy in another vector :
In order to have P(x|Rainy), consider the following values :
temperaturesWhenRainy <- temperatures[which(state=='rainy')]
What you should do next is to plot an histogram of these values. Then you should try to fit a distribution on it. You will havea parametric formula of P(x|Rainy).
If your distribution is gaussian, you can do it simply :
m <- mean(temperaturesWhenRainy)
s <- sd(temperaturesWhenRainy)
Given some x value, you have the density of probability of P(x|Rainy) :
p <- dnorm(x, mean = m, sd = s)
You can do the same procedure for P(x|Not Rainy). Then P(Rainy) and P(Not Rainy) are easy to compute.
Once you have all that stuff you can use the Bayes error rate formula, which yields your ErrorProbability for a continuous feature.
Cheers

Choosing eps and minpts for DBSCAN (R)?

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows:
library(fpc)
ds <- dbscan(USArrests,eps=20)
Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot of the kth sorted distance to its nearest neighbour. That is, the x-axis represents "Points sorted according to distance to kth nearest neighbour" and the y-axis represents the "kth nearest neighbour distance".
This type of plot is useful for helping choose an appropriate value for eps and minpts. I hope I have provided enough information for someone to be help me out. I wanted to post a pic of what I meant however I'm still a newbie so can't post an image just yet.
There is no general way of choosing minPts. It depends on what you want to find. A low minPts means it will build more clusters from noise, so don't choose it too small.
For epsilon, there are various aspects. It again boils down to choosing whatever works on this data set and this minPts and this distance function and this normalization. You can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.
OPTICS is a successor to DBSCAN that does not need the epsilon parameter (except for performance reasons with index support, see Wikipedia). It's much nicer, but I believe it is a pain to implement in R, because it needs advanced data structures (ideally, a data index tree for acceleration and an updatable heap for the priority queue), and R is all about matrix operations.
Naively, one can imagine OPTICS as doing all values of Epsilon at the same time, and putting the results in a cluster hierarchy.
The first thing you need to check however - pretty much independent of whatever clustering algorithm you are going to use - is to make sure you have a useful distance function and appropriate data normalization. If your distance degenerates, no clustering algorithm will work.
MinPts
As Anony-Mousse explained, 'A low minPts means it will build more clusters from noise, so don't choose it too small.'.
minPts is best set by a domain expert who understands the data well. Unfortunately many cases we don't know the domain knowledge, especially after data is normalized. One heuristic approach is use ln(n), where n is the total number of points to be clustered.
epsilon
There are several ways to determine it:
1) k-distance plot
In a clustering with minPts = k, we expect that core pints and border points' k-distance are within a certain range, while noise points can have much greater k-distance, thus we can observe a knee point in the k-distance plot. However, sometimes there may be no obvious knee, or there can be multiple knees, which makes it hard to decide
2) DBSCAN extensions like OPTICS
OPTICS produce hierarchical clusters, we can extract significant flat clusters from the hierarchical clusters by visual inspection, OPTICS implementation is available in Python module pyclustering. One of the original author of DBSCAN and OPTICS also proposed an automatic way to extract flat clusters, where no human intervention is required, for more information you can read this paper.
3) sensitivity analysis
Basically we want to chose a radius that is able to cluster more truly regular points (points that are similar to other points), while at the same time detect out more noise (outlier points). We can draw a percentage of regular points (points belong to a cluster) VS. epsilon analysis, where we set different epsilon values as the x-axis, and their corresponding percentage of regular points as the y axis, and hopefully we can spot a segment where the percentage of regular points value is more sensitive to the epsilon value, and we choose the upper bound epsilon value as our optimal parameter.
One common and popular way of managing the epsilon parameter of DBSCAN is to compute a k-distance plot of your dataset. Basically, you compute the k-nearest neighbors (k-NN) for each data point to understand what is the density distribution of your data, for different k. the KNN is handy because it is a non-parametric method. Once you choose a minPTS (which strongly depends on your data), you fix k to that value. Then you use as epsilon the k-distance corresponding to the area of the k-distance plot (for your fixed k) with a low slope.
For details on choosing parameters, see the paper below on p. 11:
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.
For two-dimensional data: use default value of minPts=4 (Ester et al., 1996)
For more than 2 dimensions: minPts=2*dim (Sander et al., 1998)
Once you know which MinPts to choose, you can determine Epsilon:
Plot the k-distances with k=minPts (Ester et al., 1996)
Find the 'elbow' in the graph--> The k-distance value is your Epsilon value.
If you have the resources, you can also test a bunch of epsilon and minPts values and see what works. I do this using expand.grid and mapply.
# Establish search parameters.
k <- c(25, 50, 100, 200, 500, 1000)
eps <- c(0.001, 0.01, 0.02, 0.05, 0.1, 0.2)
# Perform grid search.
grid <- expand.grid(k = k, eps = eps)
results <- mapply(grid$k, grid$eps, FUN = function(k, eps) {
cluster <- dbscan(data, minPts = k, eps = eps)$cluster
sum <- table(cluster)
cat(c("k =", k, "; eps =", eps, ";", sum, "\n"))
})
See this webpage, section 5: http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning
It gives detailed instructions on how to find epsilon. MinPts ... not so much.

numerical integration of a tricky function

The prob package numerically evaluates characteristic functions for base R distributions. For almost all distributions there are existing formulas. For a few cases, though, no closed-form solution is known. Case in point: the Weibull distribution (but see below).
For the Weibull characteristic function I essentially compute two integrals and put them together:
fr <- function(x) cos(t * x) * dweibull(x, shape, scale)
fi <- function(x) sin(t * x) * dweibull(x, shape, scale)
Rp <- integrate(fr, lower = 0, upper = Inf)$value
Ip <- integrate(fi, lower = 0, upper = Inf)$value
Rp + (0+1i) * Ip
Yes, it's clumsy, but it works surprisingly well! ...ahem, most of the time. A user reported recently that the following breaks:
cfweibull(56, shape = 0.5, scale = 1)
Error in integrate(fr, lower = 0, upper = Inf) :
the integral is probably divergent
Now, we know that the integral isn't divergent, so it must be a numerical problem. With some fiddling I could get the following to work:
fr <- function(x) cos(56 * x) * dweibull(x, 0.5, 1)
integrate(fr, lower = 0.00001, upper = Inf, subdivisions=1e7)$value
[1] 0.08024055
That's OK, but it isn't quite right, plus it takes a fair bit of fiddling which doesn't scale well. I've been investigating this for a better solution. I found a recently published "closed-form" for the characteristic function with scale > 1 (see here), but it involves Wright's generalized confluent hypergeometric function which isn't implemented in R (yet). I looked into the archives for integrate alternatives, and there's a ton of stuff out there which doesn't seem very well organized.
As part of that searching it occurred to me to translate the region of integration to a finite interval via the inverse tangent, and voila! Check it out:
cfweibull3 <- function (t, shape, scale = 1){
if (shape <= 0 || scale <= 0)
stop("shape and scale must be positive")
fr <- function(x) cos(t * tan(x)) * dweibull(tan(x), shape, scale)/(cos(x))^2
fi <- function(x) sin(t * tan(x)) * dweibull(tan(x), shape, scale)/(cos(x))^2
Rp <- integrate(fr, lower = 0, upper = pi/2, stop.on.error = FALSE)$value
Ip <- integrate(fi, lower = 0, upper = pi/2, stop.on.error = FALSE)$value
Rp + (0+1i) * Ip
}
> cfweibull3(56, shape=0.5, scale = 1)
[1] 0.08297194+0.07528834i
Questions:
Can you do better than this?
Is there something about numerical integration routines that people who are expert about such things could shed some light on what's happening here? I have a sneaking suspicion that for large t the cosine fluctuates rapidly which causes problems...?
Are there existing R routines/packages which are better suited for this type of problem, and could somebody point me to a well-placed position (on the mountain) to start the climb?
Comments:
Yes, it is bad practice to use t as a function argument.
I calculated the exact answer for shape > 1 using the published result with Maple, and the brute-force-integrate-by-the-definition-with-R kicked Maple's ass. That is, I get the same answer (up to numerical precision) in a small fraction of a second and an even smaller fraction of the price.
Edit:
I was going to write down the exact integrals I'm looking for but it seems this particular site doesn't support MathJAX so I'll give links instead. I'm looking to numerically evaluate the characteristic function of the Weibull distribution for reasonable inputs t (whatever that means). The value is a complex number but we can split it into its real and imaginary parts and that's what I was calling Rp and Ip above.
One final comment: Wikipedia has a formula listed (an infinite series) for the Weibull c.f. and that formula matches the one proved in the paper I referenced above, however, that series has only been proved to hold for shape > 1. The case 0 < shape < 1 is still an open problem; see the paper for details.
You may be interested to look at this paper, which discuss different integration methods for highly oscillating integrals -- that's what you are essentially trying to compute:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.6944
Also, another possible advice, is that instead of infinite limit you may want to specify a smaller one, because if you specify the precision that you want, then based on the cdf of the weibull you can easily estimate how much of the tail you can truncate. And if you have a fixed limit, then you can specify exactly (or almost) the number of subdivisions (e.g. in order to have a few(4-8) points per period).
I had the same problem than Jay - not with the Weibull distribution but with the integrate function. I found my answer to Jay's question 3 in a comment to this question:
Divergent Integral in R is solvable in Wolfram
The R package pracma contains several functions for solving integrals numerically. In the package, one finds some R functions for integrating certain mathematical functions. And there is a more general function integral. That helped in my case. Example code is given below.
To questions 2: The first answer to the linked question (above) states that not the complete error message of the C source file is printed out by R (The function may just converge too slowly). Therefore, I would agree with Jay that the fast fluctuation of the cosine may be a problem. In my case and in the example below it was the problem.
Example Code
# load Practical Numerical Math Functions package
library(pracma)
# define function
testfun <- function(r) cos(r*10^6)*exp(-r)
# Integrate it numerically with the basic 'integrate'.
out1 = integarte(testfun, 0, 100)
# "Error in integrate(testfun, 0, 100) : the integral is probably divergent"
# Integrate it numerically with 'integral' from 'pracma' package
# using 'Gauss-Kronrod' method and 10^-8 as relative tolerance. I
# did not try the other ones.
out2 = integral(testfun, 0, 100, method = 'Kronrod', reltol = 1e-8)
Two remarks
The integral function does not break as the integrate function does but it may take quite a long time to run. I do not know (and I did not try) whether the user can limit the number of iterations (?).
Even if the integral function finalises without errors I am not sure how correct the result is. Numerically integrating a function which is fast fluctuating around zero seems to be quite tricky since one does not know where exactly values on the fluctuating function are calculated (twice as much positive than negative values; positive values close to local maxima and negative values far off). I am not on expert on numeric integration but I just got to know some basic fixed-step integration methods in my numerics lectures. So maybe the adaptive methods used in integral deal with this problem in some way.
I'm attempting to answer questions 1 & 3. That being said I am not contributing any original code. I did a google search and hopefully this is helpful. Good luck!
Source:http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf (p.6)
#Script
library(ggplot2)
## sampling from a Weibull distribution with parameters shape=2.1 and scale=1.1
x.wei<-rweibull(n=200,shape=2.1,scale=1.1)
#Weibull population with known paramters shape=2 e scale=1
x.teo<-rweibull(n=200,shape=2, scale=1) ## theorical quantiles from a
#Figure
qqplot(x.teo,x.wei,main="QQ-plot distr. Weibull") ## QQ-plot
abline(0,1) ## a 45-degree reference line is plotted
Is this of any use?
http://www.sciencedirect.com/science/article/pii/S0378383907000452
Muraleedharana et al (2007) Modified Weibull distribution for maximum and significant wave height simulation and prediction, Coastal Engineering, Volume 54, Issue 8, August 2007, Pages 630–638
From the abstract: "The characteristic function of the Weibull distribution is derived."

Resources