Comparing graph networks in qgraph() - r

I've got two network graphs, and I'd like to be able to visually compare them.
Is there a way of overlapping them/displaying the difference in associations in qgraph()?
Thanks!
library("qgraph")
data("big5")
big5_sub1 <- big5[1:249,1:10]
big5_sub2 <- big5[249:500,1:10]
qgraph(cor(big5_sub1))
qgraph(cor(big5_sub2))

To visualize the difference you could literally subtract the two correlation matrices and then just plot the difference. Then each edge is the amount of difference between the two nodes. Not sure that it's particularly statistically acceptable to do that, but just to visualize the difference this is how I'd do it:
library("qgraph")
data("big5")
big5_sub1 <- big5[1:249,1:10]
big5_sub2 <- big5[249:500,1:10]
cor1 <- cor(big5_sub1)
cor2 <- cor(big5_sub2)
cor.diff <- cor1-cor2
qgraph(cor.diff)

Related

DBSCAN Clustering returning single cluster with noise points

I am trying to perform DBSCAN clustering on the data https://www.kaggle.com/arjunbhasin2013/ccdata. I have cleaned the data and applied the algorithm.
data1 <- read.csv('C:\\Users\\write\\Documents\\R\\data\\Project\\Clustering\\CC GENERAL.csv')
head(data1)
data1 <- data1[,2:18]
dim(data1)
colnames(data1)
head(data1,2)
#to check if data has empty col or rows
library(purrr)
is_empty(data1)
#to check if data has duplicates
library(dplyr)
any(duplicated(data1))
#to check if data has NA values
any(is.na(data1))
data1 <- na.omit(data1)
any(is.na(data1))
dim(data1)
Algorithm was applied as follows.
#DBSCAN
data1 <- scale(data1)
library(fpc)
library(dbscan)
set.seed(500)
#to find optimal eps
kNNdistplot(data1, k = 34)
abline(h = 4, lty = 3)
The figure shows the 'knee' to identify the 'eps' value. Since there are 17 attributes to be considered for clustering, I have taken k=17*2 =34.
db <- dbscan(data1,eps = 4,minPts = 34)
db
The result I obtained is "The clustering contains 1 cluster(s) and 147 noise points."
No matter whatever values I change for eps and minPts the result is same.
Can anyone tell where I have gone wrong?
Thanks in advance.
You have two options:
Increase the radius of your center points (given by the epsilon parameter)
Decrease the minimum number of points (minPts) to define a center point.
I would start by decreasing the minPts parameter, since I think it is very high and since it does not find points within that radius, it does not group more points within a group
A typical problem with using DBSCAN (and clustering in general) is that real data typically does not fall into nice clusters, but forms one connected point cloud. In this case, DBSCAN will always find only a single cluster. You can check this with several methods. The most direct method would be to use a pairs plot (a scatterplot matrix):
plot(as.data.frame(data1))
Since you have many variables, the scatterplot pannels are very small, but you can see that the points are very close together in almost all pannels. DBSCAN will connect all points in these dense areas into a single cluster. k-means will just partition the dense area.
Another option is to check for clusterability with methods like VAT or iVAT (https://link.springer.com/chapter/10.1007/978-3-642-13657-3_5).
library("seriation")
## calculate distances for a small sample
d <- dist(data1[sample(seq(nrow(data1)), size = 1000), ])
iVAT(d)
You will see that the plot shows no block structure around the diagonal indicating that clustering will not find much.
To improve clustering, you need to work on the data. You can remove irrelevant variables, you may have very skewed variables that should be transformed first. You could also try non-linear embedding before clustering.

PCA : Can I reverse the axis of the first principal component in R?

Here is a reproducible example :
set.seed(10)
pick <- sample(nrow(iris),nrow(iris)/2)
iris.training <- iris[pick,]
iris.testing <- iris[-pick,]
pca.training <- prcomp(iris.training[-5])
pca.testing <- prcomp(iris.testing[-5])
autoplot(pca.training,loadings.label=T,loadings=T)
autoplot(pca.testing,loadings.label=T,loadings=T)
Which produces the following output :
As one can see, pca on data.training and on data.testing produces very similar biplots but the first principal components has reversed its sign, they are mirrored. Is it possible to force a 180 degree rotation on the two components ?
You are not returning the rotated variables. Changed code is as below. Notice retx=TRUE
set.seed(10)
pick <- sample(nrow(iris),nrow(iris)/2)
iris.training <- iris[pick,]
iris.testing <- iris[-pick,]
pca.training <- prcomp(iris.training[-5], retx=TRUE)
pca.testing <- prcomp(iris.testing[-5], retx=TRUE)
autoplot(pca.training,loadings.label=TRUE,loadings=TRUE)
autoplot(pca.testing,loadings.label=TRUE,loadings=TRUE)
It produced the following outputs for training and testing.
I'm assuming autoplot is the function from the ggfortify package. There are probably two ways to do this. The easiest is to just ask to reverse the x axis, by writing
autoplot(pca.testing,loadings.label=TRUE,loadings=TRUE) + scale_x_reverse()
Notice that this didn't change any values: the X axis now runs from positive to negative instead of the usual direction.
The second is to modify the pca.testing object to swap the signs on the x axis.
This is statistically valid: PCA doesn't determine the signs of any components, but it's a bit tricky, because the signs show up in two places: component x for the data points, and component rotation for the arrows:
pca.testing$x[,1] <- - pca.testing$x[,1]
pca.testing$rotation[,1] <- -pca.testing$rotation[,1]
autoplot(pca.testing,loadings.label=TRUE,loadings=TRUE)
Not related to your question, but some advice: don't use T, use TRUE, otherwise the next time you have temperature data, you may inadvertantly change the value, and cause havoc with your analysis.

In R, find non-linear lines from two sets of points and then find the intersection of those points

Using R, I want to estimate two curves using points from two vectors, and then find the x and y coordinates where those estimated curves intersect.
In a strategic setting with players "t" and "p", I am simulating best responses for both players in response to what the other would pick in a strategic setting (game theory). The problem is that I don't have functions or lines, I have two sets of points originating from simulation, with one set of points corresponding to the player's best response to given actions by the other player. The actual math was too difficult for me (or matlab) to solve, which is why I'm using this simulated visual approach. I want to estimate best response functions (i.e. create non-linear curves) using the points, and then take the two estimated curves and find where they intersect in order to identify nash equilibrium (where the best response curves intersect).
As an example, here are two such vectors I am working with:
t=c(10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0)
p=c(12.3,12.3,12.3,12.3,12.3,12.3,12.4,12.4,12.4,12.5,12.5,12.5,12.6,12.6,12.7,12.7,12.8,12.8,12.9,12.9,13.0,13.1,13.1,13.2,13.3,13.4,13.5,13.4,13.5,13.6,13.6,13.7,13.8,13.8,13.9,13.9,13.9,14.0,14.0,14.0,14.0)
For the first line, the sample is made up of (t,a), and for the second line, the sample is made up of (a,p) where a is a third vector given by
a = seq(10, 14, by = 0.1)
For example, the first point for the sample corresponding to the first vector would be (10.0,10.0) and the second point would be (10.0,10.1). The first point for the sample corresponding to the second vector would be (10.0,12.3) and the second point would be (10.1,12.3).
What I originally tried to do is estimate the lines using polynomials produced by lm models, but those don't seem to always work:
plot(a,t, xlim=c(10,14), ylim=c(10,14), col="purple")
points(p,a, col="red")
fit4p <- lm(a~poly(p,3,raw=TRUE))
fit4t <- lm(t~poly(a,3,raw=TRUE))
lines(a, predict(fit4t, data.frame(x=a)), col="purple", xlim=c(10,14), ylim=c(10,14),type="l",xlab="p",ylab="t")
lines(p, predict(fit4p, data.frame(x=a)), col="green")
fit4pCurve <- function(x) coef(fit4p)[1] +x*coef(fit4p)[2]+x^2*coef(fit4p)[3]+x^3*coef(fit4p)[4]
fit4tCurve <- function(x) coef(fit4t)[1] +x*coef(fit4t)[2]+x^2*coef(fit4t)[3]+x^3*coef(fit4t)[4]
a_opt1 = optimise(f=function(x) abs(fit4pCurve(x)-fit4tCurve(x)), c(10,14))$minimum
b_opt1 = as.numeric(fit4pCurve(a_opt1))
EDIT:
After fixing the type, I get the correct answer, but it doesn't always work if the samples don't come back as cleanly.
So my question can be broken down a few ways. First, is there a better way to accomplish what I'm trying to do. I know what I'm doing isn't perfectly accurate by any means, but it seems like a decent approximation for my purposes. Second, if there isn't a better way, is there a way I could improve on the methodology I have listed above.
Restart your R session, make sure all variables are cleared and copy/paste this code. I found a few mistakes in referenced variables. Also note that R is case sensitive. My suspicion is that you've been overwriting variables.
plot(a,t, xlim=c(10,14), ylim=c(10,14), col="purple")
points(p,a, col="red")
fit4p <- lm(a~poly(p,3,raw=TRUE))
fit4t <- lm(t~poly(a,3,raw=TRUE))
lines(a, predict(fit4t, data.frame(x=a)), col="purple", xlim=c(T,P), ylim=c(10,14),type="l",xlab="p",ylab="t")
lines(p, predict(fit4p, data.frame(x=a)), col="green")
fit4pCurve <- function(x) coef(fit4p)[1] +x*coef(fit4p)[2]+x^2*coef(fit4p)[3]+x^3*coef(fit4p)[4]
fit4tCurve <- function(x) coef(fit4t)[1] +x*coef(fit4t)[2]+x^2*coef(fit4t)[3]+x^3*coef(fit4t)[4]
a_opt = optimise(f=function(x) abs(fit4pCurve(x)-fit4tCurve(x)), c(T,P))$minimum
b_opt = as.numeric(fit4pCurve(a_opt))
As you will see:
> a_opt
[1] 12.24213
> b_opt
[1] 10.03581

R: Convert correlation matrix to edge list

I want to create a network graph of my data, where the weight of the edges is defined by the correlation coefficient in a correlation matrix. The connection is defined by being statistically significant or not.
Since I want to play around with some parameters I need to have this information in an edge list rather than in matrix form, but I'm struggling as to how to convert this. I have tried to used igraph as shown below, but I cannot figure out how to get the information on which correlations are significant and which are not into the edge list. I guess weight could be set to zero to code that info, but how do I combine a correlation matrix and a p-value matrix?
library(igraph)
g <- graph.adjacency(a,weighted=TRUE)
df <- get.data.frame(g)
df
It'd be great if you could provide a minimal reproducable example, but I think I understand what you're asking for. You'll need to make a graph from a matrix using graph_from_adjacency_matrix, but make sure to input something in the weighted parameter, because otherwise the elements in the matrix represent number of edges (less than 1 means no edges). Then you can create an edge list from the graph using as_data_frame. Then perform whatever calculation you want, or join any external data you have, then you can convert it back to a graph by using graph_from_data_frame
cor_mat <- cor(mtcars)
cor_g <- graph_from_adjacency_matrix(cor_mat, mode='undirected', weighted = 'correlation')
cor_edge_list <- as_data_frame(cor_g, 'edges')
only_sig <- cor_edge_list[abs(cor_edge_list$correlation) > .75, ]
new_g <- graph_from_data_frame(only_sig, F)
For the ones who still need this, here is the answer
library(igraph)
g <- graph.adjacency(a, mode="upper", weighted=TRUE, diag=FALSE)
e <- get.edgelist(g)
df <- as.data.frame(cbind(e,E(g)$weight))

revealing clusters of interaction in igraph

I have an interaction network and I used the following code to make an adjacency matrix and subsequently calculate the dissimilarity between the nodes of the network and then cluster them to form modules:
ADJ1=abs(adjacent-mat)^6
dissADJ1<-1-ADJ1
hierADJ<-hclust(as.dist(dissADJ1), method = "average")
Now I would like those modules to appear when I plot the igraph.
g<-simplify(graph_from_adjacency_matrix(adjacent-mat, weighted=T))
plot.igraph(g)
However the only thing that I have found thus far to translate hclust output to graph is as per the following tutorial: http://gastonsanchez.com/resources/2014/07/05/Pretty-tree-graph/
phylo_tree = as.phylo(hierADJ)
graph_edges = phylo_tree$edge
graph_net = graph.edgelist(graph_edges)
plot(graph_net)
which is useful for hierarchical lineage but rather I just want the nodes that closely interact to cluster as follows:
Can anyone recommend how to use a command such as components from igraph to get these clusters to show?
igraph provides a bunch of different layout algorithms which are used to place nodes in the plot.
A good one to start with for a weighted network like this is the force-directed layout (implemented by layout.fruchterman.reingold in igraph).
Below is a example of using the force-directed layout using some simple simulated data.
First, we create some mock data and clusters, along with some "noise" to make it more realistic:
library('dplyr')
library('igraph')
library('RColorBrewer')
set.seed(1)
# generate a couple clusters
nodes_per_cluster <- 30
n <- 10
nvals <- nodes_per_cluster * n
# cluster 1 (increasing)
cluster1 <- matrix(rep((1:n)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# cluster 2 (decreasing)
cluster2 <- matrix(rep((n:1)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# noise cluster
noise <- matrix(sample(1:2, nvals, replace=TRUE) +
rnorm(nvals, sd=1.5),
nrow=nodes_per_cluster, byrow=TRUE)
dat <- rbind(cluster1, cluster2, noise)
colnames(dat) <- paste0('n', 1:n)
rownames(dat) <- c(paste0('cluster1_', 1:nodes_per_cluster),
paste0('cluster2_', 1:nodes_per_cluster),
paste0('noise_', 1:nodes_per_cluster))
Next, we can use Pearson correlation to construct our adjacency matrix:
# create correlation matrix
cor_mat <- cor(t(dat))
# shift to [0,1] to separate positive and negative correlations
adj_mat <- (cor_mat + 1) / 2
# get rid of low correlations and self-loops
adj_mat <- adj_mat^3
adj_mat[adj_mat < 0.5] <- 0
diag(adj_mat) <- 0
Cluster the data using hclust and cutree:
# convert to dissimilarity matrix and cluster using hclust
dissim_mat <- 1 - adj_mat
dend <- dissim_mat %>%
as.dist %>%
hclust
clusters = cutree(dend, h=0.65)
# color the nodes
pal = colorRampPalette(brewer.pal(11,"Spectral"))(length(unique(clusters)))
node_colors <- pal[clusters]
Finally, create an igraph graph from the adjacency matrix and plot it using the fruchterman.reingold layout:
# create graph
g <- graph.adjacency(adj_mat, mode='undirected', weighted=TRUE)
# set node color and plot using a force-directed layout (fruchterman-reingold)
V(g)$color <- node_colors
coords_fr = layout.fruchterman.reingold(g, weights=E(g)$weight)
# igraph plot options
igraph.options(vertex.size=8, edge.width=0.75)
# plot network
plot(g, layout=coords_fr, vertex.color=V(g)$color)
In the above code, I generated two "clusters" of correlated rows, and a third group of "noise".
Hierarchical clustering (hclust + cuttree) is used to assign the data points to clusters, and they are colored based on cluster membership.
The result looks like this:
For some more examples of clustering and plotting graphs with igraph, checkout: http://michael.hahsler.net/SMU/LearnROnYourOwn/code/igraph.html
You haven't shared some toy data for us to play with and suggest improvements to code, but your question states that you are only interested in plotting your clusters distinctly - that is, graphical presentation.
Although igraph comes with some nice force directed layout algorithms, such as layout.fruchterman.reingold, layout_with_kk, etc., they can, in presence of a large number of nodes, quickly become difficult to interpret and make sense of at all.
Like this:
With these traditional methods of visualising networks,
the layout algorithms, rather than the data, determine the visualisation
similar networks may end up being visualised very differently
large number of nodes will make the visualisation difficult to interpret
Instead, I find Hive Plots to be better at displaying important network properties, which, in your instance, are the cluster and the edges.
In your case, you can:
Plot each cluster on a different straight line
order the placement of nodes intelligently, so that nodes with certain properties are placed at the very end or start of each straight line
Colour the edges to identify direction of edge
To achieve this you will need to:
use the ggnetwork package to turn your igraph object into a dataframe
map your clusters to the nodes present in this dataframe
generate coordinates for the straight lines and map these to each cluster
use ggplot to visualise
There is also a hiveR package in R, should you wish to use a packaged solution. You might also find another visualisation technique for graphs very useful: BioFabric

Resources