I have a weighted network in which nodes are highly interconnected (250 nodes where 90% of the nodes have degree = 249). The connections are weighted with a normalised index that goes from 0 to 1, where 1 means a strong connection and values close to 0 identify weak connections. The weight distribution of the network is right-skewed and follows a power law, with most edges having proximity close to 0.
I am using degree-preserving randomisation tests in order to compare the network-level characteristics to the distribution of the same characteristics extracted from the randomised graphs. I have implemented a degree-preserving randomisation of the following form:
#Obtain the degree distribution to be preserved in the randomised networks
gdegree <- degree(g)
#Generate 1000 random networks with the same degree distribution as g
g.random <- vector('list', 1000)
g.random.degree <- array(dim = c(1000,vcount(g)))
for(i in 1:1000){
g.random[[i]] <- sample_degseq(gdegree)
g.random.degree[i,] <- degree(g.random[[i]])
However, the randomised networks resulting from the code above are unweighted.
Is there a way to obtain weighted random networks using degree preserving randomisation? And secondly, is there a statistical test I can perform in order to select the "significant" connections (those that are observed consistently both in the empirical network and randomised networks)?
I am currently doing a K-means cluster analysis for some customer data at my company. I want to measure the performance of this cluster, I just don't know the library packages used to measure performance of it and I am also unsure if my clusters are grouped too close together.
The data feeding my cluster is a simple RFM (recency, frequency, & monetary value). I also included average order value per transaction by customer. I used the elbow method to determine the optimal number clusters to use. Data consists of 1400 customers and 4 metric values.
Attached is also an image of the cluster plot & R Code
drop = c('CUST_Business_NM')
#Cleaning & Scaling the Data
new_cluster_data = na.omit(data)
new_cluster_data = data[, !(names(data)%in%drop)]
new_cluster_data = scale(new_cluster_data)
#Elbow Method for Optimal Clusters
k.max <- 15
data <- new_cluster_data
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
#Plot out the Elbow
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
#Create the Cluster
kmeans_test = kmeans(new_cluster_data, centers = 8, nstart = 1000)
#Visualize the Cluster
fviz_cluster(kmeans_test, data = new_cluster_data, show.clust.cent = TRUE, geom = c("point", "text"))
You probably do not want to measure the performance of cluster but the performance of the cluster algorithm, in this case kmeans.
First, you need to be clear what cluster distance measure you want to use. The result of the cluster computation is a dissimilarity matrix, thus the choice of the distance measure is critical, you can play with euclidean, manhattan, any kind of correlation or other distance measure, e.g., like this:
dis_pearson <- get_dist(yourdataset, method = "pearson")
This will give you the distance matrix and visualize it.
The output of kmeans has several bits of information. The most important with regard to your question are:
totss: the total sum of squares
withinss: vector of within-cluster sum of squares
tot.withinss: total within-cluster sum of squares
betweenss: the between-cluster sum of squares
Thus, the goal is to optimize these by playing with distances and other methods to cluster the data. Using cluster package, you can simply extract these measures by mycluster <- kmeans(yourdataframe, centers = 2) and then calling mycluster.
Side comment: kmeans requires the number of clusters defined by the user (additional effort) and it is very sensitive to outliers.
for some tree wood, the conduits in cross sections clearly aggregate as clusters. it looks natural that the Cox process modeling in spatstat (r) could be fitted for the conduits point data, and the results include a estimated "Mean cluster size". I am not sure the meaning of this index, can I think it is the mean number of conduits in clusters of the whole conduit points data?
code from an good example in the book is following:
>fitM<-kppm(redwood~1, "MatClust")
# Scale-0.08654
# Mean cluster size: 2.525 points
in their book, author of the spatstat explain the mean cluster size as the offspring points number, which is dispered by parent points like plant seedlings. in my case, there are no such process happening: conduits are xylem cells developed from cambium cells from outside of the stem annual ring, they donnot disperse randomly.
I would like to estimate the mean cluster size and cluster scale for my conduit distribution data, the Scale and Mean cluster size seems like what I want. however, the redwood data was different with mine in nature, I am not sure about the meaning of them in my data. futhermore, I am wondering, which model is suit for my context, NeymanScott, MatCluster, Thomas or others?
any suggestion is appreciated.
If you fit a parametric point process model such as a Thomas or Matern cluster
process you are assuming the data is generated by a random process that
generates a random number of clusters with a random number of points in each
cluster. The location of the points around each cluster center is also random.
The parameter kappa controls the expected number of clusters, mu
controls the expected number of points in a cluster and scale controls the
extend of the cluster. The type of process (Thomas, Matern or others)
determines the distribution within the cluster. My best suggestion is to do
simulation experiments to understand these different types of processes and
see if they are appropriate for your needs.
For example on average 10 clusters in the unit square with on average 5
points in each and a short spatial extend (scale=0.01) of the cluster gives
you fairly well-defined tight clusters:
sim1 <- rThomas(kappa = 10, mu = 5, scale = 0.01, nsim = 9)
plot(sim1, main = "")
For example on average 10 clusters in the unit square with on average 5
points in each and a bigger spatial extend (scale=0.05) of the cluster gives
a less clear picture where it is hard to see the clusters:
sim2 <- rThomas(kappa = 10, mu = 5, scale = 0.05, nsim = 9)
plot(sim2, main = "")
In conclusion: Experiment with simulation and remember to do many simulations
of each experiment rather than just one, which can be vey misleading.
I would like to use R to test whether the degree distribution of a network behaves like a power-law with scale-free property. Nonetheless, I've read different people doing this in many different ways, and one confusing point is the input one should use in the model.
Barabasi, for example, recommends fitting a power-law to the 'complementary cumulative distribution' of degrees (see Advanced Topic 3.B of chapter 4, figure 4.22). However, I've seen people fit a power-law to the degrees of the graph (obtained with igraph::degree(g)), and I've also seen others fitting a power-law to a degree distribution, obtained via igraph::degree_distribution(g, cumulative = T)
As you can see in the reproducible example below, these options give very different results. Which one is correct? and how can I get the "complementary cumulative distribution of degrees" to from a graph so I can fit a power-law?
# create a graph
g <- static.power.law.game(500, 1000, exponent.out= 2.2, exponent.in = 2.2, loops = FALSE, multiple = T)
# get input to fit power-law.
# 1) degrees of the nodes
d <- degree(g, v = V(g), mode ="all")
d <- d[ d > 0] # remove nodes with no connection
# OR ?
# 2) cumulative degree distribution
d <- degree_distribution(g, mode ="all", cumulative = T)
# Fit power law
fit <- fit_power_law(d, impelementation = "R.mle")
Well, the problem here is that you have 2 different statistics here.
The degree of a node shows how many connections it has to other nodes.
The degree distribution is the probability distribution of those degrees over the network.
For me it doesn't make much sense to apply the igraph::fit_power_law on a degree distribution as the degree distribution is already a power law to a certain extent.
However, don't forget that the igraph::fit_power_law has more options than the implementation argument, which will result in different things, depending on what you're "feeding it".
I'm exploring h2o via the R interface and I'm getting a weird weight matrix. My task is as simple as they get: given x,y compute x+y.
I have 214 rows with 3 columns. The first column(x) was drawn uniformly from (-1000, 1000) and the second one(y) from (-100,100). I just want to combine them so I have a single hidden layer with a single neuron.
This is my code:
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
train <- h2o.importFile(path = "/home/martin/projects/R NN Addition/addition.csv")
model <- h2o.deeplearning(1:2,3,train, hidden = c(1), epochs=200, export_weights_and_biases=T, nfolds=5)
and the result is
> print(h2o.weights(model,1))
x y
1 0.5586579 0.05518193
[1 row x 2 columns]
> print(h2o.weights(model,2))
1 1.802469
For some reason the weight value for y is 0.055 - 10 times lower than for x. So, in the end the neural net would compute x+y/10. However, h2o.predict actually returns the correct values (even on a test set).
I'm guessing there's a preprocessing step that's somehow scaling my data. Is there any way I can reproduce the actual weights produced by the model? I would like to be able to visualize some pretty simple neural networks.
Neural networks perform best if all the input features have mean 0 and standard deviation 1. If the features have very different standard deviations, neural networks perform very poorly. Because of that h20 does this normalization for you. In other words, before even training your net it computes mean and standard deviation of all the features you have, and replaces the original values with (x - mean) / stddev. In your case the stddev for the second feature is 10x smaller than for the first, so after the normalization the values end up being 10x more important in terms of how much they contribute to the sum, and the weights heading to the hidden neuron need to cancel it out. That's why the weight for the second feature is 10x smaller.
I have been using the mahal classifier function (Dismo package in r) in several of my analyses and recently I have discovered that it seems to give apparently wrong distance results for points that are identical to points used in training of the classifier. For background, from what I understand of mahalanobis-based classifiers, is that they use Mahalanobis distance to describe the similarity of a unclassified point by measuring the point's distance from the center of mass of the training set (while accounting for differences in scale and covariance, etc.). The mahalanobis distance score varies from –inf to 1, where one indicates no distance between the unclassified point and the centroid defined by the training set. However, I found that, for all points with identical predictor values than the training points, I still get a score of 1, as if the routine is working as a nearest neighbor classifier. This is a very troubling behavior because it has the potential to artificially increase the confidence of my overall classification.
Has anyone encountered this behavior? Any ideas on how to fix/ avoid this behavior?
I have written a small script below that showcases the odd behavior clearly:
rm(list = ls()) #remove all past worksheet variables
logo <- stack(system.file("external/rlogo.grd", package="raster"))
#presence data (points that fall within the 'r' in the R logo)
pts <- matrix(c(48.243420, 48.243420, 47.985820, 52.880230, 49.531423, 46.182616,
54.168232, 69.624263, 83.792291, 85.337894, 74.261072, 83.792291, 95.126713,
84.565092, 66.275456, 41.803408, 25.832176, 3.936132, 18.876962, 17.331359,
7.048974, 13.648543, 26.093446, 28.544714, 39.104026, 44.572240, 51.171810,
56.262906, 46.269272, 38.161230, 30.618865, 21.945145, 34.390047, 59.656971,
69.839163, 73.233228, 63.239594, 45.892154, 43.252326, 28.356155), ncol=2)
# fit model
m <- mahal(logo, pts)
#using model, predict train data
training_vals=extract(logo, pts)
x <- predict(m, training_vals)
x #results show a perfect 1 prediction, which is highly unlikely
Now, I try to make predictions for values that are an average for directly adjacent point pairs
I do this because given that:
(1) each point for each pair used to train the model have a perfect suitability and
(2) that at least some of these average points are likely to be as close to the center of the mahalanobis centroid than the original pairs
(3) I would expect at least a few of the average points to have a perfect suitability as well.
#pick two adjacent points and fit model
adjacent_training_vals=extract(logo, adjacent_pts)
new_pts=rbind(pts, adjacent_pts)
plot(logo[[1]]) #plot predictor raster and response point pairs
#use model to predict mahalanobis score for new training data (point pairs)
m <- mahal(logo, new_pts)
new_training_vals=extract(logo, new_pts)
x <- predict(m, new_training_vals)
As expected from the odd behavior described, all training points have a distance score of 1. However, lets try to predict points that are an average of each pair:
x <- predict(m, mid_vals)
This for me is further indication that the Mahal routine will give a perfect score for any data point that has equal values to any of the points used to train the model
This below is uncessessary, but just another way to prove the point:
Here I predict the same original train data with a near insignificant 'budge' of values for only one of the predictors and show that the resulting scores change quite significantly.
x <- predict(m, mod_training_vals)
x #predictions suddenly are far from perfect predictions