Rewire in igraph for R meaning of niter parameter - r

For a network g (as below), what is the practical meaning of the niter parameter in the igraph::rewire function
library(igraph)
library(dplyr)
g <- sample_smallworld(1, 10, 3, 0.05)
For example if I would run:
g1 <- g %>%
rewire(keeping_degseq(niter = 20))
g2 <- g %>%
rewire(keeping_degseq(niter = 100))
I do see differences between the two networks on a network property level (e.g. betweenness centrality), but I'm not sure which value is the most appropriate if I want to do bootstrapping on my network for example. The reason why I don't know which value to choose comes as well as I don't really understand what does the niter parameter do.

This function will randomly switch edges like this:
The switch is performed only if it wouldn't result in multi-edges.
niter specifies the number of trials. Some of the trials will not be successful.
Thus #CPak's statement in the comment that niter edges will be swapped is not correct. Actually, niter attempts will be made.
This is explained in the documentation:
http://igraph.org/r/doc/keeping_degseq.html
http://igraph.org/c/doc/igraph-Generators.html#igraph_rewire

Related

R - Compare performance of two types while controlling for interaction

I have been programming in R and have a dataset containing the results (succes or not) of two Machine Learning algorithms which have been tried out using different amounts of parameters. An example is provided below:
type success paramater_amount
a1 0 15639
a1 0 18623
a1 1 19875
a2 1 12513
a2 1 10256
a2 0 12548
I now want to compare both algorithms to see which one has the best overall performance. But there is a catch. It is known that the higher the parameter_amount, the higher the chances for success. When checking out the parameter amounts both algorithms were tested on, one can also notice that a1 has been tested with higher parameter amounts than a2 was. This would make simply counting the amount of successes of both algorithms unfair.
What would be a good approach to handle this scenario?
I will give you an answer but without any guarantees on the truth of what I'm telling you. Indeed for more precisions you should give more informations on the algorithm and other. I also propose to migrate this question to cross-validate.
Indeed, your question is a statistical question. Because, in statistics, we search for sparcity. We prefer a simpler model than a very complex one at given performance because we are worried of over-fitting : https://statisticsbyjim.com/regression/overfitting-regression-models/.
One way to do what you want is to compare the performance with respect to the complexity of the model like for this toy example :
library(tidyverse)
library(ggplot2)
set.seed(123)
# number of estimation for each models
n <- 1000
performance_1 <- round(runif(n))
complexity_1 <- round(rnorm(n, mean = n, sd = 50))
performance_2 <- round(runif(n, min = 0, max = 0.6))
complexity_2 <- round(rnorm(n, mean = n, sd = 50))
df <- data.frame(performance = c(performance_1, performance_2),
complexity = c(complexity_1, complexity_2),
models = as.factor(c(rep(1, n), rep(2, n))))
temp <- df %>% group_by(complexity, models) %>% summarise(perf = sum(performance))
ggplot(temp, aes(x = complexity, y = perf, group = models, fill = models)) +
geom_smooth() +
theme_classic()
It only works if you have many data points. Complexity for you is the number of parameters fitted. In that toy exemple, the first model seems a better because for each level of complexity it is better.

Unexpected behavior in spatstat inhomogeneous K-, F- and G-functions

I have a point pattern with about 84,000 points. Quadrat tests suggested inhomogeneous intensity to I tried different Kernel bandwidths and got very odd behavior in the inhomogeneous implementations of the K-, F- and G-functions. Here is an example of the inhomogeneous F-function plot. Clearly, the estimated F-function does not reach 1 within the distance range while the Poisson process just flatlines. The F-function should also be increasing so the dips are odd. When manually specifying a longer range of r in the Finhom() function, the function still does not evaluate beyond the suggested range of 2000.
Unfortunately, I cannot share my data. However, I managed to reproduce some of the errors with an admittedly very simple example of a point pattern on the unit square:
library(spatstat) # version 1.57-1
# define point pattern
ex <- as.ppp(data.frame(x = c(.9, .25, .29, .7, .72, .8, .72, .85),
y = c(.1, .25, .29, .5, .5, .1, .45, .08)),
W = owin(c(0,1), c(0,1)))
plot(ex)
# testing inhomogeneity
quadrat.test(ex, 3, 3, method = "M", nsim = 500) # p around 0.05
# set bandwidth
diggle <- bw.diggle(ex)
# suggested bandwidth of 0.028
# estimate inhomogeneous F-function
Fi <- Finhom(ex, sigma = diggle)
plot(Fi, main ="Finhom for ex pattern")
The plot is attached here. Similar to my real data, the plot stops evaluating at r = 0.5, flatlines and does not go up all the way to 1.
Interestingly, when supplying the intensity directly via the lambda argument in the Finhom() function, the behavior changes:
lambda_ex <- density(ex, sigma = diggle, at = "points")
Fi_lambda <- Finhom(ex, lambda = lambda_ex)
plot(Fi_lambda, main ="Finhom w/ lambda directly")
Here, the functions behave as expected.
My questions are:
why is there a difference between directly supplied intensity vs. intensity internally estimated in the Finhom() function?
what could be the reason for the odd behavior of the F-function here? A code issue or user error? (Sidenote, the G- and K-functions also return odd behavior, to keep this question short-ish, I've focused on the F-function)
Thank you!
As pointed out by Adrian Baddeley in the other answer this is not a bug in Finhom per se. You would expect that
Fi <- Finhom(ex, sigma = diggle)
should be equivalent to
lambda_ex <- density(ex, sigma = diggle, at = "points")
Fi_lambda <- Finhom(ex, lambda = lambda_ex)
However, different values of the argument lmin are implied by these commands. In the first case lambda is estimated everywhere in the window and the minimum value used. In the second case only the given values of lambda are used to find the minimum. That can of course be quite different. The importance of lmin is illustrated in the code below (note that discrepancy between data and inhomogeneous Poisson is of the same type in all cases).
The other part about the estimate stopping at r=0.5 is not surprising since border correction is used and the window is the unit square. When r=0.5 the entire window is "shaved off", so there is no data left.
library(spatstat)
#> spatstat 1.56-1.031 (nickname: 'Psycho chicken')
X <- swedishpines
lam <- density(X, at = "points", sigma = 10)
lam_min <- min(lam)
plot(Finhom(X, lmin = lam_min), legend = FALSE, col = 1, main = "Finhom for different values of lmin")
s <- 2^(1:3)
for(i in seq_along(s)){
plot(Finhom(X, lmin = lam_min/s[i]), col = i+1, add = TRUE)
}
s <- c(1,s)
legend("topleft", legend = paste0("min(lam)/", s), lty = 1, col = 1:length(s))
Created on 2018-11-24 by the reprex package (v0.2.1)
The "inhomogeneous" functions Kinhom, Ginhom, Finhom involve making adjustments for the spatially varying intensity of the point process. They only work if (a) the intensity has been accurately estimated, and (b) the point process satisfies certain technical assumptions which justify the adjustment calculation (see the references in the help files, or the relevant section of the spatstat book).
The plot of density(ex, sigma=bw.diggle) shows very high peaks and very low troughs in the estimated intensity, suggesting that the data are under-smoothed, so that (a) is not satisfied. The results obtained with bw.scott or bw.CvL are much better behaved. (Remember that bw.diggle is designed for clustered patterns.) For example, I get a reasonably nice plot with
plot(Finhom(ex, sigma=bw.CvL))
Yes, it does seem a bit disconcerting that the results are different when 'lambda' is given as a pixel image and as a numeric vector. This occurs, as Ege explains, because of the different rules for calculating the default value of the important argument lmin. It's not really a bug -- the original authors of the code for Ginhom and Finhom designed it this way; I will consult them for advice about whether we should change it. In the meantime, you can make the two calculations agree if you specify the value of lmin.

What should be an Optimal value of K in K means Clustering for it to be implemented on ANY Dataset?

Like the Question speaks, I'm making a Visualization tool that is bound to work for any dataset provided. What should be the Optimal K value I should select and How?
So you can use Calinski criterion from vegan package, also your phrasing of question is little debatable. I am hoping this is what you expecting, please comment in case of otherwise.
For example, You can do:
n = 100
g = 6
set.seed(g)
d <- data.frame(
x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
require(vegan)
fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
This would result in value of 5, which means you can use 5 clusters, the algorithm works with the fundamentals on withiness and betweeness of k means clustering. You can also write a manual code basis on that.
From the documentation from here:
criterion: The criterion that will be used to select the best
partition. The default value is "calinski", which refers to the
Calinski-Harabasz (1974) criterion. The simple structure index ("ssi")
is also available. Other indices are available in function clustIndex
(package cclust). In our experience, the two indices that work best
and are most likely to return their maximum value at or near the
optimal number of clusters are "calinski" and "ssi".
A manual code would look like something as below:
At the first iteration since there is no SSB( Betweeness of the variance).
wss <- (nrow(d)-1)*sum(apply(d,2,var))
#TSS = WSS ##No betweeness at first observation, total variance equal to withness variance, TSS is total sum of squares, WSS is within sum of squress
for (i in 2:15) wss[i] <- sum(kmeans(d,centers=i)$withinss) #from second observation onward, since TSS would remain constant and between sum of squares will increase, correspondingly withiness would decrease.
#Plotting the same using the plot command for 15 iterations.(This is not constant, you have to decide what iterations you can do here.
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares",col="mediumseagreen",pch=12)
An output of above can look like this, Here after the point at which the line become constant is the point that you have to pick for optimum cluster size, in this case it is 5 :

Louvain community detection in R using igraph - format of edges and vertices

I have a correlation matrix of scores that I would like to run community detection on using the Louvain method in igraph, in R. I converted the correlation matrix to a distance matrix using cor2dist, as below:
distancematrix <- cor2dist(correlationmatrix)
This gives a 400 x 400 matrix of distances from 0-2. I then made the list of edges (the distances) and vertices (each of the 400 individuals) using the below method from http://kateto.net/networks-r-igraph (section 3.1).
library(igraph)
test <- as.matrix(distancematrix)
mode(test) <- "numeric"
test2 <- graph.adjacency(test, mode = "undirected", weighted = TRUE, diag = TRUE)
E(test2)$weight
get.edgelist(test2)
From this I then wrote csv files of the 'from' and 'to' edge list, and corresponding weights:
edgeweights <-E(test2)$weight
write.csv(edgeweights, file = "edgeweights.csv")
fromtolist <- get.edgelist(test2)
write.csv(fromtolist, file = "fromtolist.csv")
From these two files I produced a .csv file called "nodes.csv" which simply had all the vertex IDs for the 400 individuals:
id
1
2
3
4
...
400
And a .csv file called "edges.csv", which detailed 'from' and 'to' between each node, and provided the weight (i.e. the distance measure) for each of these edges:
from to weight
1 2 0.99
1 3 1.20
1 4 1.48
...
399 400 0.70
I then tried to use this node and edge list to create an igraph object, and run louvain clustering in the following way:
nodes <- read.csv("nodes.csv", header = TRUE, as.is = TRUE)
edges <- read.csv("edges.csv", header = TRUE, as.is = TRUE)
clustergraph <- graph_from_data_frame(edges, directed = FALSE, vertices = nodes)
clusterlouvain <- cluster_louvain(clustergraph)
Unfortunately this did not do the louvain community detection correctly. I expected this to return around 2-4 different communities, which could be plotted similarly to here, but sizes(clusterlouvain) returned:
Community sizes
1
400
indicating that all individuals were sorted into the same community. The clustering also ran immediately (i.e. with almost no computation time), which also makes me think it was not working correctly.
My question is: Can anyone suggest why the cluster_louvain method did not work and identified just one community? I think I must be specifying the distance matrix or edges/nodes incorrectly, or in some other way not giving the correct input to the cluster_louvain method. I am relatively new to R so would be very grateful for any advice. I have successfully used other methods of community detection on the same distance matrix (i.e. k-means) which identified 2-3 communities, but would like to understand what I have done wrong here.
I'm aware there are multiple other queries about using igraph in R, but I have not found one which explicitly specifies the input format of the edges and nodes (from a correlation matrix) to get the louvain community detection working correctly.
Thank you for any advice! I can provide further information if helpful.
I believe that cluster_louvain did exactly what it should do with your data.
The problem is your graph.Your code included the line get.edgelist(test2). That must produce a lot of output. Instead try, this
vcount(test2)
ecount(test2)
Since you say that your correlation matrix is 400x400, I expect that you will
get that vcount gives 400 and ecount gives 79800 = 400 * 399 / 2. As you have
constructed it, every node is directly connected to all other nodes. Of course there is only one big community.
I suspect that what you are trying to do is group variables that are correlated.
If the correlation is near zero, the variables should be unconnected. What seems less clear is what to do with variables with correlation near -1. Do you want them to be connected or not? We can do it either way.
You do not provide any data, so I will illustrate with the Ionosphere data from
the mlbench package. I will try to mimic your code pretty closely, but will
change a few variable names. Also, for my purposes, it makes no sense to write
the edges to a file and then read them back again, so I will just directly
use the edges that are constructed.
First, assuming that you want variables with correlation near -1 to be connected.
library(igraph)
library(mlbench) # for Ionosphere data
library(psych) # for cor2dist
data(Ionosphere)
correlationmatrix = cor(Ionosphere[, which(sapply(Ionosphere, class) == 'numeric')])
distancematrix <- cor2dist(correlationmatrix)
DM1 <- as.matrix(distancematrix)
## Zero out connections where there is low (absolute) correlation
## Keeps connection for cor ~ -1
## You may wish to choose a different threshhold
DM1[abs(correlationmatrix) < 0.33] = 0
G1 <- graph.adjacency(DM1, mode = "undirected", weighted = TRUE, diag = TRUE)
vcount(G1)
[1] 32
ecount(G1)
[1] 140
Not a fully connected graph! Now let's find the communities.
clusterlouvain <- cluster_louvain(G1)
plot(G1, vertex.color=rainbow(3, alpha=0.6)[clusterlouvain$membership])
If instead, you do not want variables with negative correlation to be connected,
just get rid of the absolute value above. This should be much less connected
DM2 <- as.matrix(distancematrix)
## Zero out connections where there is low correlation
DM2[correlationmatrix < 0.33] = 0
G2 <- graph.adjacency(DM2, mode = "undirected", weighted = TRUE, diag = TRUE)
clusterlouvain <- cluster_louvain(G2)
plot(G2, vertex.color=rainbow(4, alpha=0.6)[clusterlouvain$membership])

alternative for wilcox.test in R

I'm trying a significance test using wilcox.test in R. I want to basically test if a value x is significantly within/outside a distribution d.
I'm doing the following:
d = c(90,99,60,80,80,90,90,54,65,100,90,90,90,90,90)
wilcox.test(60,d)
Wilcoxon rank sum test with continuity correction
data: 60 and d
W = 4.5, p-value = 0.5347
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(60, d) : cannot compute exact p-value with ties
and basically the p-value is the same for a big range of numbers i test.
I've tried wilcox_test() from the coin package, but i can't get it to work testing a value against a distribution.
Is there an alternative to this test that does the same and knows how to deal with ties?
How worried are you about the non-exact results? I would guess that the approximation is reasonable for a data set this size. (I did manage to get coin::wilcox_test working, and the results are not hugely different ...)
d <- c(90,99,60,80,80,90,90,54,65,100,90,90,90,90,90)
pfun <- function(x) {
suppressWarnings(w <- wilcox.test(x,d)$p.value)
return(w)
}
testvec <- 30:120
p1 <- sapply(testvec,pfun)
library("coin")
pfun2 <- function(x) {
dd <- data.frame(y=c(x,d),f=factor(c(1,rep(2,length(d)))))
return(pvalue(wilcox_test(y~f,data=dd)))
}
p2 <- sapply(testvec,pfun2)
library("exactRankTests")
pfun3 <- function(x) {wilcox.exact(x,d)$p.value}
p3 <- sapply(testvec,pfun3)
Picture:
par(las=1,bty="l")
matplot(testvec,cbind(p1,p2,p3),type="s",
xlab="value",ylab="p value of wilcoxon test",lty=1,
ylim=c(0,1),col=c(1,2,4))
legend("topright",c("stats::wilcox.test","coin::wilcox_test",
"exactRankTests::wilcox.exact"),
lty=1,col=c(1,2,4))
(exactRankTests added by request, but given that it's not maintained any more and recommends the coin package, I'm not sure how reliable it is. You're on your own for figuring out what the differences among these procedures are and which would be best to use ...)
The results make sense here -- the problem is just that your power is low. If your value is completely outside the range of the data, for n=15, that will be a probability of something like 2*(1/16)=0.125 [i.e. probability of your sample ending up as the first or the last element in a permutation], which is not quite the same as the minimum value here (wilcox.test: p=0.105, wilcox_test: p=0.08), but that might be an approximation issue, or I might have some detail wrong. Nevertheless, it's in the right ballpark.
You can do this.
wilcox.test(60,d, exact=FALSE)

Resources