Plot a tree-graph with ggplot - r

I have hierarchical data in this form:
df <- data.frame(root=rep("unclustered",22),
itr1=paste0("1.",c(1,5,1,2,4,1,3,2,5,5,6,9,4,3,4,8,5,7,3,2,10,8)),
itr2=paste0("2.",c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,10,17,18,19,20,21)),
itr3=paste0("3.",c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22)),stringsAsFactors = F)
Which describes iterative cluster assignment of data points - the rows. The first column, root, assigns all points to the root cluster, then each column is an iteration of the clustering which takes the clusters of the previous iteration and breaks them down to further sub-clusters.
I'm trying to plot this process using a tree network.
I know that using data.tree, I can simply do:
df$pathString <- do.call(paste,c(df,sep="/"))
df.graph <- data.tree::as.Node(df)
plot(df.graph)
But I'm looking for something a bit fancier, preferably with a ggplot look.
So I converted df.graph to an igraph object:
df.igraph <- data.tree::as.igraph.Node(df.graph)
And tried to use ggraph:
library(ggraph)
ggraph(df.igraph, 'igraph', algorithm = 'tree') +
geom_edge_link() +
ggforce::theme_no_axes()
Any idea how to get the ggraph option to include the nodes with their labels, add arrows to the edges, and perhaps color each level differently?

This seems to get me close:
V(df.igraph)$class <- names(V(df.igraph))
ggraph(df.igraph,layout='tree')+
geom_edge_link(arrow=arrow(length=unit(2,'mm')),end_cap=circle(3,'mm'))+
geom_node_label(aes(label=class))+
theme_void()

Related

iGraph - Spacing between verticies

I have a dataset called data. The data is not that important, but every interaction has a name. I want to create a graph in iGraph with the following code:
tab <- count(data, B, S, K)
factors <- table(interaction(tab$B, tab$K),interaction(tab$S,tab$K))
graph1 <- graph_from_incidence_matrix(factors)
plot(graph1, vertex.size = 40, layout = layout.bipartite)
However, I get the following:
All the names of interactions are completely mixed together. I can make it a little more readable by lowering the vertex.size, but I want to find a solution to my problem.
I want to create more space between the verticies, but I cannot seem to find the right way.
I have tried creating a manual graph by using tkplot, but it is annoying that I manually have to sort them each time.
Best regards

Error in axis(side = side, at = at, labels = labels, ...) : invalid value specified for graphical parameter "pch"

I have applied DBSCAN algorithm on built-in dataset iris in R. But I am getting error when tried to visualise the output using the plot( ).
Following is my code.
library(fpc)
library(dbscan)
data("iris")
head(iris,2)
data1 <- iris[,1:4]
head(data1,2)
set.seed(220)
db <- dbscan(data1,eps = 0.45,minPts = 5)
table(db$cluster,iris$Species)
plot(db,data1,main = 'DBSCAN')
Error: Error in axis(side = side, at = at, labels = labels, ...) :
invalid value specified for graphical parameter "pch"
How to rectify this error?
I have a suggestion below, but first I see two issues:
You're loading two packages, fpc and dbscan, both of which have different functions named dbscan(). This could create tricky bugs later (e.g. if you change the order in which you load the packages, different functions will be run).
It's not clear what you're trying to plot, either what the x- or y-axes should be or the type of plot. The function plot() generally takes a vector of values for the x-axis and another for the y-axis (although not always, consult ?plot), but here you're passing it a data.frame and a dbscan object, and it doesn't know how to handle it.
Here's one way of approaching it, using ggplot() to make a scatterplot, and dplyr for some convenience functions:
# load our packages
# note: only loading dbscacn, not loading fpc since we're not using it
library(dbscan)
library(ggplot2)
library(dplyr)
# run dbscan::dbscan() on the first four columns of iris
db <- dbscan::dbscan(iris[,1:4],eps = 0.45,minPts = 5)
# create a new data frame by binding the derived clusters to the original data
# this keeps our input and output in the same dataframe for ease of reference
data2 <- bind_cols(iris, cluster = factor(db$cluster))
# make a table to confirm it gives the same results as the original code
table(data2$cluster, data2$Species)
# using ggplot, make a point plot with "jitter" so each point is visible
# x-axis is species, y-axis is cluster, also coloured according to cluster
ggplot(data2) +
geom_point(mapping = aes(x=Species, y = cluster, colour = cluster),
position = "jitter") +
labs(title = "DBSCAN")
Here's the image it generates:
If you're looking for something else, please be more specific about what the final plot should look like.

Set common y axis limits from a list of ggplots

I am running a function that returns a custom ggplot from an input data (it is in fact a plot with several layers on it). I run the function over several different input data and obtain a list of ggplots.
I want to create a grid with these plots to compare them but they all have different y axes.
I guess what I have to do is extract the maximum and minimum y axes limits from the ggplot list and apply those to each plot in the list.
How can I do that? I guess its through the use of ggbuild. Something like this:
test = ggplot_build(plot_list[[1]])
> test$layout$panel_scales_x
[[1]]
<ScaleContinuousPosition>
Range:
Limits: 0 -- 1
I am not familiar with the structure of a ggplot_build and maybe this one in particular is not a standard one as it comes from a "custom" ggplot.
For reference, these plots are created whit the gseaplot2 function from the enrichplot package.
I dont know how to "upload" an R object but if that would help, let me know how to do it.
Thanks!
edit after comments (thanks for your suggestions!)
Here is an example of the a gseaplot2 plot. GSEA stands for Gene Set Enrichment Analysis, it is a technique used in genomic studies. The gseaplot2 function calculates a running average and then plots it and another bar plot on the bottom.
and here is the grid I create to compare the plots generated from different data:
I would like to have a common scale for the "Running Enrichment Score" part.
I guess I could try to recreate the gseaplot2 function and input all of the datasets and then create the grid by facet_wrap, but I was wondering if there was an easy way of extracting parameters from a plot list.
As a reproducible example (from the enrichplot package):
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
wpgmtfile <- system.file("extdata/wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(term, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME
ewp2 <- GSEA(geneList, TERM2GENE = wpid2gene, TERM2NAME = wpid2name, verbose=FALSE)
gseaplot2(ewp2, geneSetID=1, subplots=1:2)
And this is how I generate the plot list (probably there is a much more elegant way):
plot_list = list()
for(i in 1:3) {
fig_i = gseaplot2(ewp2,
geneSetID=i,
subplots=1:2)
plot_list[[i]] = fig_i
}
ggarrange(plotlist=plot_list)

plot a subset of igraph data , just subset by their attribute

I have a fblog data set ,PolParty is one attribute of my data, I want to plot just 2 political parties (say P1 and P2) and plot the network of blogs.
i wrote code below but i think it is wrong , can some one help me?
library(statnet)
library(igraph)
library(sand)
data(fblog)
fblog = upgrade_graph(fblog)
class(fblog)
summary(fblog)
V(fblog)$PolParty
table(V(fblog)$PolParty)
p1<-V(fblog)[PolParty=="PS"] # by their labels/names
p2<-V(fblog)[PolParty=="UDF"]
class(p1)
The objects p1 and p2 that you were creating are of the class igraph.vs (instead of igraph). This object just documents the vertices. Which is not a full graph. Hence when you try to plot it you don't get anything.
Based on the following post: Subset igraph graph by label
g=subgraph.edges(graph=fblog, eids=which(V(fblog)$PolParty==" PS"), delete.vertices = TRUE)
plot(g)
The above works. NOTE: regarding the output of V(fblog)$PolParty- everything is preceeded by a space hence you need to use V(fblog)$PolParty==" PS"
UPDATE: if I want to subset based on 2 conditions- I will modify the which() command:
g=subgraph.edges(graph=fblog, eids=which(V(fblog)$PolParty==" PS"| V(fblog)$PolParty==" UDF"), delete.vertices = TRUE)
plot(g)

Add to ggplot with element of different length

I'm new to ggplot2 and I'm trying to figure out how I can add a line to an already existing plot I created. The original plot, which is the cumulative distribution of a column of data T1 from a data frame x, has about 100,000 elements in it. I have successfully plotted this using ggplot2 and stat_ecdf() with the code I posted below. Now I want to add another line using a set of (x,y) coordinates, but when I try this using geom_line() I get the error message:
Error in data.frame(x = c(0, 7.85398574631245e-07, 3.14159923334398e-06, :
arguments imply differing number of rows: 1001, 100000
Here's the code I'm trying to use:
> set.seed(42)
> x <- data.frame(T1=rchisq(100000,1))
> ps <- seq(0,1,.001)
> ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
> p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(aes(ts,ps))
That's what produces the error from above. Now here's the code using base graphics that I used to use but that I am now trying to move away from:
plot(ecdf(x$T1),xlab="T1",ylab="Cum. Prob.",xlim=c(0,4),ylim=c(0,1),main="Empirical vs. Theoretical Distribution of T1")
lines(ts,ps)
I've seen some other posts about adding lines in general, but what I haven't seen is how to add a line when the two originating vectors are not of the same length. (Note: I don't want to just use 100,000 (x,y) coordinates.)
As a bonus, is there an easy way, similar to using abline, to add a drop line on a ggplot2 graph?
Any advice would be much appreciated.
ggplot deals with data.frames, you need to make ts and ps a data.frame then specify this extra data.frame in your call to geom_line:
set.seed(42)
x <- data.frame(T1=rchisq(100000,1))
ps <- seq(0,1,.001)
ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
tpdf <- data.frame(ts=ts,ps=ps)
p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(data=tpdf, aes(ts,ps))

Resources