Input to fit a power-law to degree distribution of a network - r

I would like to use R to test whether the degree distribution of a network behaves like a power-law with scale-free property. Nonetheless, I've read different people doing this in many different ways, and one confusing point is the input one should use in the model.
Barabasi, for example, recommends fitting a power-law to the 'complementary cumulative distribution' of degrees (see Advanced Topic 3.B of chapter 4, figure 4.22). However, I've seen people fit a power-law to the degrees of the graph (obtained with igraph::degree(g)), and I've also seen others fitting a power-law to a degree distribution, obtained via igraph::degree_distribution(g, cumulative = T)
As you can see in the reproducible example below, these options give very different results. Which one is correct? and how can I get the "complementary cumulative distribution of degrees" to from a graph so I can fit a power-law?
library(igraph)
# create a graph
set.seed(202)
g <- static.power.law.game(500, 1000, exponent.out= 2.2, exponent.in = 2.2, loops = FALSE, multiple = T)
# get input to fit power-law.
# 1) degrees of the nodes
d <- degree(g, v = V(g), mode ="all")
d <- d[ d > 0] # remove nodes with no connection
# OR ?
# 2) cumulative degree distribution
d <- degree_distribution(g, mode ="all", cumulative = T)
# Fit power law
fit <- fit_power_law(d, impelementation = "R.mle")

Well, the problem here is that you have 2 different statistics here.
The degree of a node shows how many connections it has to other nodes.
The degree distribution is the probability distribution of those degrees over the network.
For me it doesn't make much sense to apply the igraph::fit_power_law on a degree distribution as the degree distribution is already a power law to a certain extent.
However, don't forget that the igraph::fit_power_law has more options than the implementation argument, which will result in different things, depending on what you're "feeding it".

Related

how to fit a von mises distribution to my data for generating random samples

My data is comprised of 16 pairs of distance and bearings from a particular location.
I am trying to generate 1000 resamples of those 16 pairs (i.e. create new sets of X2 and Y2). So that in the end I will have 1000, 16 pairs of distances and bearings resulting in new 16 spatial points.
my data, using the bearing and the distance to generate X2 and Y2
What I have done so far is resample (reshuffle) from the 16 values I already have,
`
f2 <- function(x) data.frame(bearing = sample(min(HRlog$beartoenc):max(HRlog$beartoenc), 16, replace = TRUE),
distance = sample(min(HRlog$distoenc):max(HRlog$distoenc), 16, replace = TRUE))
se1randcent <- as.data.frame(lapply(seq(1000), f2))
`
but that was no go with my advisors.
I have been told I should resample according to the von mises distribution, i.e fit the distribution to my data then regenerate the 16 pairs from this distribution according the K value I get.
I don't really know what this means. Can anyone help me figure it out?
I am posting this question because I am in a serious time pressure and 9 months pregnant so my pregnancy brain doesn't help me figure it out in a timely fashion.
Any help on this will be greatly appreciated !!
The package circular in R could be helpful. The von Mises distribution kappa parameter can be calculated from the angles you provided either using a minimization method, as I show below, or through built in maximum likelihood estimators mle.vonmises(). Once you have the parameters you can use rvonmises with the number of samples and calculated parameters to generate the sample. The generated sample seems to be on [0,2pi], so there could be some adjustment to make sure the mean values are correctly represented.
Fitting the distance would probably be a separate distribution and the issue of the possible dependency between the two is not addressed in this answer.
library(circular) # circular statistics and bessel functions
# converting the bearing to be on the interval [-pi,pi] which is conventional for von Mises
bearing <- c(19.07,71.88,17.23,202.39,173.67,357.04,5.82,5.82,95.53,5.82,94.13,157.67,19.07,202.39,173.67,128.15)
bearing_rad <- bearing*2*pi/360 - pi
# sample statistics
circ_mean <- mean.circular(bearing_rad) # mu of von Mises
circ_sd <- sd.circular(bearing_rad) # related to kappa of von Mises
circ_var <- var.circular(bearing_rad)
# function to return difference in variances between
diff_vars2 <- function(kappa){
# squaring to make the function convex
return((1 - A1(kappa) - circ_var)^2)
}
# solving for kappa by matching the variances
kappa_solution <- optim(par = 1,fn=diff_vars2,lower = 0,method="L-BFGS-B")
# sample from von mises distribution
sampled_vals <- rvonmises(n=100, mu=circ_mean, kappa=kappa_solution$par)
Added content based on comments
One problem with tests for uniformity are that you have a small sample size. Two methods that seem appropriate are the Rayleigh and Kuiper Tests, which test against uniformity. Background on those are given at NCSS Manual
Both are implemented in circular, but I am not sure if the modified Rayleigh is used. The results for bearings_rad indicate that Rayleigh p-value = 0.2 and Kuiper p-value < 0.05.
rayleigh.test(x=bearing_rad)
kuiper.test(x=bearing_rad)
You can add the fitted histogram to the above plot by using dvonmises. This will give the radius, which can be converted to x and y using the standard polar coordinate translation. Making the angles work can be a bit tricky. If you don't want the rose diagram in the background you can use plot.
rose.diag(bearing_rad)
density_vals <- dvonmises(x=seq(0,2*pi,0.01)-circ_mean,mu = 0,kappa=kappa_solution$par)
x_from_polar <- density_vals*cos(seq(0,2*pi,0.01))
y_from_polar <- density_vals*sin(seq(0,2*pi,0.01))
lines(x=x_from_polar,y=y_from_polar,col='red')

Generate multivariate nonnormal random numbers in R

Background
I want to generate multivariate distributed random numbers with a fixed variance matrix. For example, I want to generate a 2 dimensional data with covariance value = 0.5, each dimensional variance = 1. The first maginal of data is a norm distribution with mean = 0, sd = 1, and the next is a exponential distribution with rate = 2.
My attempt
My attempt is that we can generate a correlated multinormal distribution random numbers and then revised them to be any distribution by Inverse transform sampling.
In below, I give an example about transforming 2 dimensional normal distribution random numbers into a norm(0,1)+ exp(2) random number:
# generate a correlated multi-normal distribution, data[,1] and data[,2] are standard norm
data <- mvrnorm(n = 1000,mu = c(0,0), Sigma = matrix(c(1,0.5,0.5,1),2,2))
# calculate the cdf of dimension 2
exp_cdf = ecdf(data[,2])
Fn = exp_cdf(data[,2])
# inverse transform sampling to get Exponetial distribution with rate = 2
x = -log(1-Fn + 10^(-5))/2
mean(x);cor(data[,1],x)
Out:
[1] 0.5035326
[1] 0.436236
From the outputs, the new x is a set of exponential(rate = 2) random numbers. Also, x and data[,1] are correlated with 0.43. The correlated variance is 0.43, not very close to my original setting value 0.5. It maybe a issue. I think covariance of sample generated should stay more closer to initial setting value. In general, I think my method is not quite decent, maybe you guys have some amazing code snippets.
My question
As a statistics graduate, I know there exist 10+ methods to generate multivariate random numbers theoretically. In this post, I want to collect bunch of code snippets to do it automatically using packages or handy . And then, I will compare them from different aspects, like time consuming and quality of data etc. Any ideas is appreciated!
Note
Some users think I am asking for package recommendation. However, I am not looking for any recommendation. I already knew commonly used statistical theroms and R packages. I just wanna know how to generate multivariate distributed random numbers with a fixed variance matrix decently and give a code example about generate norm + exp random numbers. I think there must exist more powerful code snippets to do it in a decent way! So I ask for help right now!
Sources:
generating-correlated-random-variables, math
use copulas to generate multivariate random numbers, stackoverflow
Ross simulation, theoretical book
R CRAN distribution task View

the meaning of cluster size in Cox process models in spatstat

for some tree wood, the conduits in cross sections clearly aggregate as clusters. it looks natural that the Cox process modeling in spatstat (r) could be fitted for the conduits point data, and the results include a estimated "Mean cluster size". I am not sure the meaning of this index, can I think it is the mean number of conduits in clusters of the whole conduit points data?
code from an good example in the book is following:
>fitM<-kppm(redwood~1, "MatClust")
>fitM
#...
# Scale-0.08654
# Mean cluster size: 2.525 points
in their book, author of the spatstat explain the mean cluster size as the offspring points number, which is dispered by parent points like plant seedlings. in my case, there are no such process happening: conduits are xylem cells developed from cambium cells from outside of the stem annual ring, they donnot disperse randomly.
I would like to estimate the mean cluster size and cluster scale for my conduit distribution data, the Scale and Mean cluster size seems like what I want. however, the redwood data was different with mine in nature, I am not sure about the meaning of them in my data. futhermore, I am wondering, which model is suit for my context, NeymanScott, MatCluster, Thomas or others?
any suggestion is appreciated.
Jingming
If you fit a parametric point process model such as a Thomas or Matern cluster
process you are assuming the data is generated by a random process that
generates a random number of clusters with a random number of points in each
cluster. The location of the points around each cluster center is also random.
The parameter kappa controls the expected number of clusters, mu
controls the expected number of points in a cluster and scale controls the
extend of the cluster. The type of process (Thomas, Matern or others)
determines the distribution within the cluster. My best suggestion is to do
simulation experiments to understand these different types of processes and
see if they are appropriate for your needs.
For example on average 10 clusters in the unit square with on average 5
points in each and a short spatial extend (scale=0.01) of the cluster gives
you fairly well-defined tight clusters:
library(spatstat)
set.seed(42)
sim1 <- rThomas(kappa = 10, mu = 5, scale = 0.01, nsim = 9)
plot(sim1, main = "")
For example on average 10 clusters in the unit square with on average 5
points in each and a bigger spatial extend (scale=0.05) of the cluster gives
a less clear picture where it is hard to see the clusters:
sim2 <- rThomas(kappa = 10, mu = 5, scale = 0.05, nsim = 9)
plot(sim2, main = "")
In conclusion: Experiment with simulation and remember to do many simulations
of each experiment rather than just one, which can be vey misleading.

Determine mode locations of the kernel density estimate of multimodal univariate data

If I have a density function and I plot it with a particular bandwidth, I visually determine that there are 7 local maximums. I would just like to know how to plot separate distributions of the particular maximums on the same plot.
Also, if is possible to know exactly where the maximums occur by running some code? I can make ball-park estimates using the plot but is there an R function that I can use to get the exact points? I would like to know the mean and variance of the 7 densities that I have identified.
Specifically, I have the following:
plot(density(stamp, bw=0.0013,kernel = "gaussian"))
Determining which modes are real in a kernel density estimate is a matter of which bandwidth you chose to use. This is a complicated thing, and I don't advise choosing but a single bandwidth, as even different optimal rules of thumb can give you different answers. In general, the number of modes of a kde is less than the number of the underlying density in the oversmooothed case and more so in the undersmoothed case. There are many papers that cover this topic and give you many options to help determine the veracity of a mode. e.g., check out Silverman's mode test for Gaussian kernels, Friedman and Fisher's prim algorithm, Marron's siZer, and Minnotte and Scott's mode tree are good places to start.
A naive thing you can do, given a single KDE choice of bandwidth is check the run lengths.
In fact, with the bandwidth you have chosen, I find 9 modes. Just calculate the sign change of the difference in the series, and calculate the cumulative length of the runs in order to find the points. Every other point will be a mode or an antimode, depending on which came first. (You can check the sign to determine this)
library(BSDA)
dstamp <- density(Stamp$thickness, bw=0.0013, kernel = "gaussian")
chng <- cumsum(rle(sign(diff(dstamp$y)))$lengths)
plot(dstamp)
abline(v = dstamp$x[chng[seq(1,length(chng),2)]])
Since I needed something to get only the strongest modes, I created a dead simple algorithm that allows you to increase sensitivity by tweaking the number of density samples (to deacrease local noise) and put a minum density threshold, proportional to the max density (to decrease the global noise).
find_posterior_modes <- function(x, n.samples = 100, filter = .1) {
d <- density(x, n = n.samples)
x <- with(d, sapply(2:(n.samples - 1), function(i) if (y[i] > y[i - 1] & y[i] > y[i + 1] & y[i] > max(y) * filter) x[i]))
unlist(x)
}
I recently released the package ModEstM, it uses the same method as shayaa, with two features to suppress the less significant modes :
it is possible to choose the bandwidth of the density estimation, by choosing the "adjust" parameter of the density function,
the modes are presented in decreasing order of the corresponding density.

Calculating p-value from pseudo-F in R

I'm working with a very large dataset with 132,019 observations of 18 variables. I've used the clusterSim package to calculate the pseudo-F statistic on clusters created using Kohonen SOMs. I'm trying to assess the various cluster sizes (e.g., 4, 6, 9 clusters) with p-values, but I'm getting weird results and I'm not statistically savvy enough to know what's going on.
I use the following code to get the pseudo-F.
library(clusterSim)
psF6 <- index.G1(yelpInfScale, cl = som.6$unit.classif)
psF6
[1] 48783.4
Then I use the following code to get the p-value. When I do lower.tail = T I get a 1 and when I do lower.tail = F I get a 0.
k6 = 6
pf(q = psF6, df1 = k6 - 1, df2 = n - k6, lower.tail = FALSE)
[1] 0
I guess I was expecting not a round number, so I'm confused about how to interpret the results. I get the exact same results regardless of which cluster size I evaluate. I read something somewhere about reversing df1 and df2 in the calculation, but that seems weird. Also, the reference text I'm using (Larose's "Data Mining and Predictive Analytics") uses this to evaluate k-means clusters, so I'm wondering if the problem is that I'm using Kohonen clusters.
I'd check your data, but its not impossible to get p value as either 0 or 1. In your case, assuming you have got your data right, it indicates that you're data is heavily skewed and the clusters you have created are ideal fit. So when you're doing lower.tail = FALSE, the p-value of zero indicates that you're sample is classified with 100% accuracy and there is no chance of an error. The lower.tail = TRUE gives 1 indicates that you clusters very close to each other. In other words, your observations are clustered well away from each other to have a 0 on two tailed test but the centre points of clusters are close enough to give a p value of 1 in one tailed test. If I were you I'd try 'K-means with splitting' variant with different distance parameter 'w' to see how the data fits. IF for some 'w' it fits with very low p values for clusters, I don't think a model as complex as SOM is really necessary.

Resources