Create directed random graph specifing alpha of power-law degree distribution - r

I have a real directed graph for which I know the number of nodes and edges. The degree distribution approximates a power-law distribution. Now I want to create a random graph replicating the following features of my real graph:
Number of nodes
Number of edges
(Similar) power-law indegree and out
distribution
Let's assume g is my real graph of 10000 nodes and 30000 edges
exp.out = 2.2
exp.in = 2.3
set.seed(123)
g <- static.power.law.game(10000, 30000, exp.out, exp.in, multiple=TRUE)
Yet I don't know exp.out and exp.in. Then I try to estimate the power-law exponents with the plfit function (downloaded here):
plfit(degree(g, mode="in")+1)
# $xmin
# [1] 5
#
# $alpha
# [1] 2.97
#
# $D
# [1] 0.01735342
plfit(degree(g, mode="out")+1)
# $xmin
# [1] 5
#
# $alpha
# [1] 2.83
#
# $D
# [1] 0.01589222
From which I then derive my distribution functions (respectively for indegree and outdegree):
p(x) ~ x^-2.97 for x >= 5
p(x) ~ x^-2.83 for x >= 5
According to the documentation of static.power.law.game
The game simply uses static.fitness.game with appropriately
constructed fitness vectors. In particular, the fitness of vertex i is
i^(-alpha), where alpha = 1/(gamma-1) and gamma is the exponent given
in the arguments
As far as I understand it, to replicate my alphas I should pass as gammas respectively 1.3367 (2.97=1/(x-1)) and 1.35336 (2.83=1/(x-1)). Then
set.seed(321)
random.g <- static.power.law.game(10000, 30000, 1.35336, 1.3367, multiple=TRUE)
# Error in .Call("R_igraph_static_power_law_game", no.of.nodes, no.of.edges, :
# At games.c:3748 : out-degree exponent must be >= 2, Invalid value
Yet the fact that static.power.law.game only takes degree exponents higher then or equal to 2 makes me think that probably I am missing something...

exp_out and exp_in should simply be the desired exponent of the out-degree and in-degree distributions, there is no need to do any transformations on the exponents you have obtained from plfit. However, note that it is unlikely that you will recover your "observed" exponents exactly due to finite size effects

Related

Why is Adam optimization unable to converge in linear regression?

I am studying Adam optimizer. This is a toy problem. In R, I generate some artificial data:
Y = c0 + c1 * x1 + c2 * x2 + noise
In the above equation, x1, x2 and noise are normal random numbers I generated in R, theta = [c0, c1, c2] is the parameter I try to estimate with Adam optimizer. For this simple regression problem, I can use analytical method to determine the theta parameter which is the k in my R codes below.
Regarding Adam algorithm, I use the formulae from this site
Overview: Adam
I change the step size eta in this parametric study. The final theta from Adam algorithm is not the same as the analytical solution k in my R codes.
I checked my codes many times. I run the codes line by line and cannot understand why Adam algorithm cannot converge.
Added:
I changed the algorithm to AMSGrad. It perform better than Adam in this case. However, AMSGrad does not converge.
rm(list = ls())
n=500
x1=rnorm(n,mean=6,sd=1.6)
x2=rnorm(n,mean=4,sd=2.5)
X=cbind(x1,x2)
A=as.matrix(cbind(intercept=rep(1,n),x1,x2))
Y=-20+51*x1-15*x2+rnorm(n,mean=0,sd=2);
k=solve(t(A)%*%A,t(A)%*%Y) # k is the parameters determined by analytical method
MSE=sum((A%*%k-Y)^2)/(n);
iterations=4000 # total number of steps
epsilon = 0.0001 # set precision
eta=0.04 # step size
beta1=0.9
beta2=0.999
t1=integer(iterations)
t2=matrix(0,iterations,3)
t3=integer(iterations)
epsilon1=1E-8 # small number defined for numerical computation
X=as.matrix(X)# convert data table X into a matrix
N=dim(X)[1] # total number of observations
X=as.matrix(cbind(intercept=rep(1,length(N)),X))# add a column of ones to represent intercept
np=dim(X)[2] # number of parameters to be determined
theta=matrix(rnorm(n=np,mean=0,sd=2),1,np) # Initialize theta:1 x np matrix
m_i=matrix(0,1,np) # initialization, zero vector
v_i=matrix(0,1,np) # initialization, zero vector
for(i in 1:iterations){
error=theta%*%t(X)-t(Y) # error = (theta * x' -Y'). Error is a 1xN row vector;
grad=1/N*error%*%X # Gradient grad is 1 x np vector
m_i=beta1*m_i+(1-beta1)*grad # moving average of gradient, 1 x np vector
v_i=beta2*v_i+(1-beta2)*grad^2 # moving average of squared gradients, 1 x np vector
# corrected moving averages
m_corrected=m_i/(1-beta1^i)
v_corrected=v_i/(1-beta2^i)
d_theta=eta/(sqrt(v_corrected)+epsilon1)*m_corrected
theta=theta-d_theta
L=sqrt(sum((d_theta)^2)) # calculating the L2 norm
t1[i]=L # record the L2 norm in each step
if ((is.infinite(L))||(is.nan(L))) {
print("Learning rate is too large. Lowering the rate may help.")
break
}
else if (L<=epsilon) {
print("Algorithm convergence is reached.")
break # checking whether convergence is obtained or not
}
# if (i==1){
# browser()
# }
}
plot(t1,type="l",ylab="norm",lwd=3,col=rgb(0,0,1))
k
theta

Find number of clusters using distance matrix with hierarchical clustering

How do I determine the optimal number of clusters while using hierarchical clustering. If I am just having the distance matrix as I am measuring only pairwise distances (levenshtein distances), how do I find out the optimal number of clusters? I referred to other posts they all use k-means, hierarchical but not for string type of data as shown below. Any suggestions on how to use R to find the number of clusters?
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i) {do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
Several statistics can be used.
Look for example at the WeightedCluster package that can compute and plot a series of such statistics.
To illustrate, you get the optimal number of groups for each available statistics as follows:
library("WeightedCluster")
hcRange <- as.clustrange(hc, diss=as.dist(d), ncluster=6)
summary(hcRange)
## 1. N groups 1. stat
## PBC 3 0.8799136
## HG 3 1.0000000
## HGSD 3 0.9987651
## ASW 3 0.4136550
## ASWw 3 0.4722895
## CH 3 8.3605263
## R2 6 0.4734561
## CHsq 3 20.6538462
## R2sq 6 0.6735039
## HC 3 0.0000000
You can also plot the statistics (here we show the Average silhouette width, ASWw, Huber's Gamma, HG, and the Point biserial correlation) for all the computed solutions
plot(hcRange, stat = c("ASWw", "HG", "PBC"), lwd = 2)
The better solution seems to be the three groups solution.

How to compute the coefficients of a polynomial given a vector containing the roots in R

I want to compute the coefficients of a polynomial based on a vector containing the roots. I first defined a vector of coefficients:
pol <- c(0,1,2,3,4)
and computed the roots
roots <- polyroot(pol)
to have a test result.
Then i tried the following:
result <- 1
for (n in 1:(length(roots))){
result <- c(result, 0) + c(0,-roots[n]*result)
}
But the my result is the following:
result
[1] 1.00+0i 0.75+0i 0.50+0i 0.25+0i 0.00+0i
What am I missing here?
Notice that
identical(polyroot(pol), polyroot(pol / 4))
# [1] TRUE
That is, by going from a polynomial to its roots you lose information about the coefficient of the highest degree term (in this case, 4). For instance, 2x^2-x=2x(x-1/2), but also x^2-x/2=x(x-1/2), so that the roots are the same and we only normalized the first polynomial with respect to the quadratic term. So,
Re(result) * 4
# [1] 4 3 2 1 0
gives the result but also requires the knowledge of tail(pol, 1).

generating random x and y coordinates with a minimum distance

Is there a way in R to generate random coordinates with a minimum distance between them?
E.g. what I'd like to avoid
x <- c(0,3.9,4.1,8)
y <- c(1,4.1,3.9,7)
plot(x~y)
This is a classical problem from stochastic geometry. Completely random points in space where the number of points falling in disjoint regions are independent of each other corresponds to a homogeneous Poisson point process (in this case in R^2, but could be in almost any space).
An important feature is that the total number of points has to be random before you can have independence of the counts of points in disjoint regions.
For the Poisson process points can be arbitrarily close together. If you define a process by sampling the Poisson process until you don't have any points that are too close together you have the so-called Gibbs Hardcore process. This has been studied a lot in the literature and there are different ways to simulate it. The R package spatstat has functions to do this. rHardcore is a perfect sampler, but if you want a high intensity of points and a big hard core distance it may not terminate in finite time... The distribution can be obtained as the limit of a Markov chain and rmh.default lets you run a Markov chain with a given Gibbs model as its invariant distribution. This finishes in finite time but only gives a realisation of an approximate distribution.
In rmh.default you can also simulate conditional on a fixed number of points. Note that when you sample in a finite box there is of course an upper limit to how many points you can fit with a given hard core radius, and the closer you are to this limit the more problematic it becomes to sample correctly from the distribution.
Example:
library(spatstat)
beta <- 100; R = 0.1
win <- square(1) # Unit square for simulation
X1 <- rHardcore(beta, R, W = win) # Exact sampling -- beware it may run forever for some par.!
plot(X1, main = paste("Exact sim. of hardcore model; beta =", beta, "and R =", R))
minnndist(X1) # Observed min. nearest neighbour dist.
#> [1] 0.102402
Approximate simulation
model <- rmhmodel(cif="hardcore", par = list(beta=beta, hc=R), w = win)
X2 <- rmh(model)
#> Checking arguments..determining simulation windows...Starting simulation.
#> Initial state...Ready to simulate. Generating proposal points...Running Metropolis-Hastings.
plot(X2, main = paste("Approx. sim. of hardcore model; beta =", beta, "and R =", R))
minnndist(X2) # Observed min. nearest neighbour dist.
#> [1] 0.1005433
Approximate simulation conditional on number of points
X3 <- rmh(model, control = rmhcontrol(p=1), start = list(n.start = 42))
#> Checking arguments..determining simulation windows...Starting simulation.
#> Initial state...Ready to simulate. Generating proposal points...Running Metropolis-Hastings.
plot(X3, main = paste("Approx. sim. given n =", 42))
minnndist(X3) # Observed min. nearest neighbour dist.
#> [1] 0.1018068
OK, how about this? You just generate random number pairs without restriction and then remove the onces which are too close. This could be a great start for that:
minimumDistancePairs <- function(x, y, minDistance){
i <- 1
repeat{
distance <- sqrt((x-x[i])^2 + (y-y[i])^2) < minDistance # pythagorean theorem
distance[i] <- FALSE # distance to oneself is always zero
if(any(distance)) { # if too close to any other point
x <- x[-i] # remove element from x
y <- y[-i] # and remove element from y
} else { # otherwise...
i = i + 1 # repeat the procedure with the next element
}
if (i > length(x)) break
}
data.frame(x,y)
}
minimumDistancePairs(
c(0,3.9,4.1,8)
, c(1,4.1,3.9,7)
, 1
)
will lead to
x y
1 0.0 1.0
2 4.1 3.9
3 8.0 7.0
Be aware, though, of the fact that these are not random numbers anymore (however you solve problem).
You can use rejection sapling https://en.wikipedia.org/wiki/Rejection_sampling
The principle is simple: you resample until you data verify the condition.
> set.seed(1)
>
> x <- rnorm(2)
> y <- rnorm(2)
> (x[1]-x[2])^2+(y[1]-y[2])^2
[1] 6.565578
> while((x[1]-x[2])^2+(y[1]-y[2])^2 > 1) {
+ x <- rnorm(2)
+ y <- rnorm(2)
+ }
> (x[1]-x[2])^2+(y[1]-y[2])^2
[1] 0.9733252
>
The following is a naive hit-and-miss approach which for some choices of parameters (which were left unspecified in the question) works well. If performance becomes an issue, you could experiment with the package gpuR which has a GPU-accelerated distance matrix calculation.
rand.separated <- function(n,x0,x1,y0,y1,d,trials = 1000){
for(i in 1:trials){
nums <- cbind(runif(n,x0,x1),runif(n,y0,y1))
if(min(dist(nums)) >= d) return(nums)
}
return(NA) #no luck
}
This repeatedly draws samples of size n in [x0,x1]x[y0,y1] and then throws the sample away if it doesn't satisfy. As a safety, trials guards against an infinite loop. If solutions are hard to find or n is large you might need to increase or decrease trials.
For example:
> set.seed(2018)
> nums <- rand.separated(25,0,10,0,10,0.2)
> plot(nums)
runs almost instantly and produces:
Im not sure what you are asking.
if you want random coordinates here.
c(
runif(1,max=y[1],min=x[1]),
runif(1,max=y[2],min=x[2]),
runif(1,min=y[3],max=x[3]),
runif(1,min=y[4],max=x[4])
)

how to solve multi dimension integral equations with variable on upper bounds

I would like to solve an equation as below, where the X is the only unknown variable and function f() is a multi-variate Student t distribution.
More precisely, I have a multi k-dimensional integral for a student density function, which gives us a probability as a result, and I know that this probability is given as q. The lower bound for all integral is -Inf and I know the last k-1 dimension's upper bound (as given), the only unknown variable is the first integral's upper bound. It should have an solution for a variable and one equation. I tried to solve it in R. I did Dynamic Conditional Correlation to have a correlation matrix in order to specify my t-distribution. So plug this correlation matrix into my multi t distribution "dmvt", and use the "adaptIntegral" function from "cubature" package to construct a function as an argument to the command "uniroot" to solve the upper bound on the first integral. But I have some difficulties to achieve what I want to get. (I hope my question is clear) I have provided my codes before, somebody told me that there is problem, but cannot find why there is an issue there. Many thanks in advance for your help.
I now how to deal with it with one dimension integral, but I don't know how a multi-dimension integral equation can be solved in R? (e.g. for 2 dimension case)
\int_{-\infty}^{X}
\int_{-\infty}^{Y_{1}} \cdots
\int_{-\infty}^{Y_{k}}
f(x,y_{1},\cdots y_{k})
d_{x}d_{y_{1},}\cdots d_{y_{k}} = q
This code fails:
require(cubature)
require(mvtnorm)
corr <- matrix(c(1,0.8,0.8,1),2,2)
f <- function(x){ dmvt(x,sigma=corr,df=3) }
g <- function(y) adaptIntegrate(f,
lowerLimit = c( -Inf, -Inf),
upperLimit = c(y, -0.1023071))$integral-0.0001
uniroot( g, c(-2, 2))
Since mvtnorm includes a pmvt function that computes the CDF of the multivariate t distribution, you don't need to do the integral by brute force. (mvtnorm also includes a quantile function qmvt, but only for "equicoordinate" values.)
So:
library(mvtnorm)
g <- function(y1_upr,y2_upr=-0.123071,target=1e-4,df=3) {
pmvt(upper=c(y1_upr,y2_upr),df=df)-target
}
uniroot(g,c(-10000,0))
## $root
## [1] -17.55139
##
## $f.root
## [1] -1.699876e-11
## attr(,"error")
## [1] 1e-15
## attr(,"msg")
## [1] "Normal Completion"
##
## $iter
## [1] 18
##
## $estim.prec
## [1] 6.103516e-05
##
Double-check:
pmvt(upper=c(-17.55139,-0.123071),df=3)
## [1] 1e-04
## attr(,"error")
## [1] 1e-15
## attr(,"msg")
## [1] "Normal Completion"

Resources