Setting up a statnet model in R - r

I would like to simulate exponential family random graphs, and I just started learning to use the statnet and ergm R packages. From the tutorial I found online, I am able to learn an ERGM model from an example dataset:
# install.packages('statnet')
# install.packages('ergm')
# install.packages('coda')
library(statnet)
set.seed(123)
data(package='ergm') # tells us the datasets in our packages
data(florentine) # loads flomarriage and flobusiness data
# Triad model
flomodel <- ergm(flomarriage ~ edges + triangle)
summary(flomodel)
Currently, I would like to use the simulate command to simulate networks with a pre-specified number of nodes from a pre-specified formula (that is not learned from any particular dataset), for example, P(y) = 1/Z exp(a * num_edges + b * num_triangles), where a and b are user-specified coefficients.
How should I go about writing such a model in statnet?

You can simulate from a given formula with simulate (or simulate.formula):
simulate(flomarriage ~ edges + triangles, coef = c(3,1))
To fix a simulation to have the same number of edges as the given graph (flomarriage in this case)
simulate(flomarriage ~ edges + triangles, coef = c(3,1), constraints = ~edges)
Not every constraint you might want to apply is available since each requires a specific mcmc sampler, but for a list of what is available see ?ergm.constraints
To fix the simulation to have an arbitrary number of nodes and edges (not based on an observed data) a workaround is to create such a network first. For example, to simulate over networks with 17 nodes and 16 edges.
test.mat = matrix(0, 17, 17)
test.mat[1,] = 1 #adds 16 edges
test.net = as.network(test.mat, directed = F)
test.sim = simulate(test.net ~ triangles, coef = 1, constraints = ~edges)
summary.statistics(test.sim ~ edges() + triangles())
p.s. I don't recommend using the triangles term in ERGM models. The geometrically weighted terms (gwesp, gwdsp) are the best substitutes which are more stable.

Related

Clustering Time Series in R - is K Mean accurate?

My data set is composed by measurement of the same index for 14 years (columns) for 105 countries (rows). I want to cluster countries based on their index trend over time.
I am trying Hierarchical clustering (hclust) and K Medoids (pam) exploiting DTW distance matrix (dtw package).
I also tried K Mean, using the DTW distance matrix as first argument of function kmeans. The algorithm works, but I'm not sure about the accuracy of that, since K Mean exploit Eucledian Distance and computes centroids as means.
I am also thinking about using data directly, but I can't understand how the result would be accurate since the algorithm would consider different measurement of the same variable over time as different variables in order to compute the centroids at each iteration and Eucledian distance to assign observations to clusters. It doesn't seem to me that this process could cluster time series as well as Hierarchical and K Medoids clustering.
Is K Mean algorithm a good choice when clustering Time Series or it is better to use algorithms that exploit distance concept as DTW (but are slower)? Does it exist an R function that allows to use K Mean algorithm with distance matrix or a specific package to cluster Time Series data?
KMeans will do exactly what you tell it to do. Unfortunately, trying to feed a time series dataset into a KMeans algo will result in meaningless results. The KMeans algo, and most general clustering methods, are built around the Euclidean distance, which does not seem to be a good measure for time series data. Quite simply, K-means often doesn’t work when clusters are not round shaped because of it uses some kind of distance function and distance is measured from cluster center. Check out the GMM algo as an alternative. It sounds like you are going with R for this experiment. If so, check out the sample code below.
Here is a KMeans cluster.
Here is a GMM cluster.
Which one looks more like a time series plot to you??!!
I Googled around for a good sample of R code to demonstrate how GMM clustering works. Unfortunately, I couldn't find anything decent. Personally, I use Python much more than I use R. If you are open to a Python solution, check out the sample code below.
import numpy as np
import itertools
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture
print(__doc__)
# Number of samples per component
n_samples = 500
# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
.7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]
lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(X)
bic.append(gmm.bic(X))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
'darkorange'])
clf = best_gmm
bars = []
# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
xpos = np.array(n_components_range) + .2 * (i - 2)
bars.append(plt.bar(xpos, bic[i * len(n_components_range):
(i + 1) * len(n_components_range)],
width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
.2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)
# Plot the winner
splot = plt.subplot(2, 1, 2)
Y_ = clf.predict(X)
for i, (mean, cov, color) in enumerate(zip(clf.means_, clf.covariances_,
color_iter)):
v, w = linalg.eigh(cov)
if not np.any(Y_ == i):
continue
plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)
# Plot an ellipse to show the Gaussian component
angle = np.arctan2(w[0][1], w[0][0])
angle = 180. * angle / np.pi # convert to degrees
v = 2. * np.sqrt(2.) * np.sqrt(v)
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(.5)
splot.add_artist(ell)
plt.xticks(())
plt.yticks(())
plt.title('Selected GMM: full model, 2 components')
plt.subplots_adjust(hspace=.35, bottom=.02)
plt.show()
Finall, from the image below, you can clearly see how
Here's an example of how to visualize clusters using plotGMM. The code to reproduce follows:
require(quantmod)
SCHB <- fortify(getSymbols('SCHB', auto.assign=FALSE))
set.seed(730) # for reproducibility
mixmdl <- mixtools::normalmixEM(Cl(SCHB), k = 5); plot_GMM(mixmdl, k = 5) # 5 clusters
plot_GMM(mixmdl, k = 5)
I hope that helps. Oh, and for plotting time series with ggplot2, you should avail yourself of ggplot2's fortify function. Hope that helps.

Weighted Kmeans R

I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below:
A B C
1 12 10 1
2 8 11 2
3 14 10 1
. . . .
. . . .
. . . .
in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R:
Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)
But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more important than the two other variables?
how can I insert their weights in the model?
Thank you all
You have to use a kmeans weighted clustering, like the one presented in flexclust package:
https://cran.r-project.org/web/packages/flexclust/flexclust.pdf
The function
cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
weights An optional vector of weights to be used in the fitting process. Works only in combination with hard competitive learning.
A toy example using iris data:
library(flexclust)
data(iris)
cl <- cclust(iris[,-5], k=3, save.data=TRUE,weights =c(1,0.5,1,0.1),method="hardcl")
cl
kcca object of family ‘kmeans’
call:
cclust(x = iris[, -5], k = 3, method = "hardcl", weights = c(1, 0.5, 1, 0.1), save.data = TRUE)
cluster sizes:
1 2 3
50 59 41
As you can see from the output of cclust, also using competitive learning the family is always kmenas.
The difference is related to cluster assignment during training phase:
If method is "kmeans", the classic kmeans algorithm as given by
MacQueen (1967) is used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If "hardcl",
on-line updates are used (AKA hard competitive learning), which work
by randomly drawing an observation from x and moving the closest
center towards that point (e.g., Ripley 1996).
The weights parameter is just a sequence of numbers, in general I use number between 0.01 (minimum weight) and 1 (maximum weight).
I had the same problem and the answer here is not satisfying for me.
What we both wanted was an observation-weighted k-means clustering in R. A good readable example for our question is this link: https://towardsdatascience.com/clustering-the-us-population-observation-weighted-k-means-f4d58b370002
However the solution to use the flexclust package is not satisfying simply b/c the used algorithm is not the "standard" k-means algorithm but the "hard competitive learning" algorithm. The difference are well described above and in the package description.
I looked through many sites and did not find any solution/package in R in order to use to perform a "standard" k-means algorithm with weighted observations. I was also wondering why the flexclust package explicitly do not support weights with the standard k-means algorithm. If anyone has an explanation for this, please feel free to share!
So basically you have two options: First, rewrite the flexclust-algorithm to enable weights within the standard approach. Or second, you can estimate weighted cluster centroids as starting centroids and perform a standard k-means algorithm with only one iteration, then compute new weighted cluster centroids and perform a k-means with one iteration and so on until you reach convergence.
I used the second alternative b/c it was the easier way for me. I used the data.table package, hope you are familiar with it.
rm(list=ls())
library(data.table)
### gen dataset with sample-weights
dataset <- data.table(iris)
dataset[, weights:= rep(c(1, 0.7, 0.3, 4, 5),30)]
dataset[, Species := NULL]
### initial hclust for estimating weighted centroids
clustering <- hclust(dist(dataset[, c(1:4)], method = 'euclidean'),
method = 'ward.D2')
no_of_clusters <- 4
### estimating starting centroids (weighted)
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol = ncol(dataset[, c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
### performing weighted k-means as explained in my post
iter <- 0
cluster_i <- 0
cluster_iminus1 <- 1
## while loop: if number of iteration is smaller than 50 and cluster_i (result of
## current iteration) is not identical to cluster_iminus1 (result of former
## iteration) then continue
while(identical(cluster_i, cluster_iminus1) == F && iter < 50){
# update iteration
iter <- iter + 1
# k-means with weighted centroids and one iteration (may generate warning messages
# as no convergence is reached)
cluster_kmeans <- kmeans(x = dataset[, c(1:4)], centers = weighted_centroids, iter = 1)$cluster
# estimating new weighted centroids
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol=ncol(dataset[,c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
# update cluster_i and cluster_iminus1
if(iter == 1) {cluster_iminus1 <- 0} else{cluster_iminus1 <- cluster_i}
cluster_i <- cluster_kmeans
}
## merge final clusters to data table
dataset[, cluster := cluster_i]
If you want to increase the weight of a variable (column), just multiply it with a constant c > 1.
It's trivial to show that this increases the weight in the SSQ optimization objective.

How to delete certain nodes from a regression tree built by `ctree()` from `party` package

I've built a regression tree using ctree() from package party.
The results of my model have many nodes which contain equal probability of dependent variables (E.g. : class A = 0.33, class B = 0.33, Class C = 0.33). I want to take out these nodes from the model. The package tree has the snip.tree() command where we can specify the node numbers to be deleted from the model.This command does not recognize regression trees built with ctree(). Please let me know if there is a way to delete certain nodes from a regression tree built using ctree()
I have used the model:
rv.mod1 <- ctree(ldclas ~ L2 + L3 + L4 + L5 + L6 + ele + ndvi + nd_var + nd_ps, data = rv, controls = ctree_control(minsplit = 0, minbucket = 0))
pr.rv.mod1 <- snip.tree(rv.mod1, nodes = nn2.rv.mod1$nodes)
nn2.rv.mod1$nodes is a vector with nodes to be deleted from the rv.mod1 model.But I get an error:
Error in snip.tree(rv.mod1, nodes = nn2.rv.mod1$nodes) :
not legitimate tree
I don't think there is direct way to do this, but I will propose a "hack" using the weights argument in ctree.
Let's start with a reproducible example
library(party)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct)
Now, suppose you want to get rid of node number 5. You can do the following
NewWeigths <- rep(1, dim(iris)[1]) # Setting a weights vector which will be passed into the `weights` attribute in `ctree`
Node <- 5 # Selecting node #5
n <- nodes(irisct, Node)[[1]] # Retrieving the weights of that node
NewWeigths[which(as.logical(n$weights))] <- 0 # Setting these weigths to zero, so `ctree` will disregard them
irisct2 <- ctree(Species ~ .,data = iris, weights = NewWeigths) # creating the new tree with new weights
plot(irisct2)
Note how nodes 2, 6 and 7 (now they are named 2, 4 and 5 because we have less splits) remained exactly with the same distribution and splitting conditions.
I didn't test it for all nodes, but it seem to work fairly well

How do I predict new data's cluster after clustering training data?

I have already trained my clustering model using hclust:
model=hclust(distances,method="ward”)
And the result looks good:
Now I get some new data records, I want to predict which cluster every one of them belongs to. How do I get it done ?
Clustering is not supposed to "classify" new data, as the name suggests - it is the core concept of classification.
Some of the clustering algorithms (like those centroid based - kmeans, kmedians etc.) can "label" new instance based on the model created. Unfortunately hierarchical clustering is not one of them - it does not partition the input space, it just "connects" some of the objects given during clustering, so you cannot assign the new point to this model.
The only "solution" to use the hclust in order to "classify" is to create another classifier on top of the labeled data given by hclust. For example you can now train knn (even with k=1) on the data with labels from hclust and use it to assign labels to new points.
As already mentioned, you can use a classifier such as class :: knn, to determine which cluster a new individual belongs to.
The KNN or k-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. More specifically, the distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance.
Next I leave a code as an example for the iris data.
library(scorecard)
library(factoextra)
library(class)
df_iris <- split_df(iris, ratio = 0.75, seed = 123)
d_iris <- dist(scale(df_iris$train[,-5]))
hc_iris <- hclust(d_iris, method = "ward.D2")
fviz_dend(hc_iris, k = 3,cex = 0.5,k_colors = c("#00AFBB","#E7B800","#FC4E07"),
color_labels_by_k = TRUE, ggtheme = theme_minimal())
groups <- cutree(hc_iris, k = 3)
table(groups)
Predict new data
knnClust <- knn(train = df_iris$train[,-5], test = df_iris$test[,-5] , k = 1, cl = groups)
knnClust
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 3 3 2 2 2 2 2 3 3 2 2 3 2 2 2 2 2 2 2 2 2
Levels: 1 2 3
# p1 <- fviz_cluster(list(data = df_iris$train[,-5], cluster = groups), stand = F) + xlim(-11.2,-4.8) + ylim(-3,3) + ggtitle("train")
# p2 <- fviz_cluster(list(data = df_iris$test[,-5], cluster = knnClust),stand = F) + xlim(-11.2,-4.8) + ylim(-3,3) + ggtitle("test")
# gridExtra::grid.arrange(p1,p2,nrow = 2)
pca1 <- data.frame(prcomp(df_iris$train[,-5], scale. = T)$x[,1:2], cluster = as.factor(groups), factor = "train")
pca2 <- data.frame(prcomp(df_iris$test[,-5], scale. = T)$x[,1:2], cluster = as.factor(knnClust), factor = "test")
pca <- as.data.frame(rbind(pca1,pca2))
Plot train and test data
ggplot(pca, aes(x = PC1, y = PC2, color = cluster, size = 1, alpha = factor)) +
geom_point(shape = 19) + theme_bw()
You can use this classification and then use LDA to predict which class the new point should fall into.
I face the similar problem and work out a temporal solution.
In my environment R, the function hclust gives the label for the train data.
We can use one supervised learning model to reconnect label and features.
And then we just do the same data processing when we deal with a supervised learning model.
If we face a binary classification model, we can use KS value, AUC value and so on to see the performance of this clustering.
Similarly, we can use PCA method on the feature and extract PC1 as a label.
To binning this label, we get a new label fitted to classification.
In the same way, we do the same processing when we deal with a classification model.
In R, I find PCA method processes much faster than hclust. (Mayank 2016)
In practice, I find this way is easy to deploy the model.
But I suspect whether this temporal solution results in bias on prediction or not.
Ref
Mayank. 2016. “Hclust() in R on Large Datasets.” Stack Overflow. hclust() in R on large datasets.
Why not compute the centroid of the points for each hclust cluster, then assign a new point to the nearest using the same distance function ?
knn in class will only look at nearest n and only allows Euclidean distance.
There's no need to run a classifier.

How to draw the regression tree correctly when clustering using R

I get stuck when trying to build a model.
I want to class the dataset freeny into 10 subsets by year.
data(freeny)
options(digits=2)
year<-as.integer(rownames(freeny))
freeny<-cbind(freeny,year)
freeny = freeny[sample(1:nrow(freeny),length(1:nrow(freeny))),1:ncol(freeny)]
freenyValues= freeny[,1:5]
freenyTargets=decodeClassLabels(freeny[,6])
freeny = splitForTrainingAndTest(freenyValues,freenyTargets,ratio=0.15)
km<-kmeans(freeny$inputsTrain,10,iter.max = 100, nstart = 5)
kclust=km$cluster
library(tree)
kclust=as.factor(kclust)
mdp=cbind(freeny$inputsTrain,kclust)
mdp<-data.frame(mdp)
mdp.tr=tree(kclust~.,mdp)
but the result is that the tree only has 5 terminal nodes.It should be 10 terminal nodes because I divide into 10 clusters by kmeans. What's wrong?
No. It shouldn't. tree is an algorithm that tries to fit a tree given predictor and response, and stops if
the terminal nodes are too small or too few to be split.
(manual page). Try adjusting the minsize parameter (see ?tree.control).
minsize: The smallest allowed node size: a weighted quantity. The
default is 10.
I think the following will do what is intended:
mdp.tr=tree(kclust~.,mdp, minsize= 1)

Resources