Using tidygraph to derive nodes and graph-level metrics per group - r

I am trying out tidygraph on networks coming from different experimental treatments and am mostly interested to get out graph-wide metrics and potentially also node-level metrics. I can't seem to be able to get my head around the way tidygraph works.
I am using R v3.5 and tidygraph v1.1.
My data are structured in that way:
dat <- data.frame(Treatment = rep(c("A","B"),each = 2),
from = c("sp1","sp2","sp1","sp2"),
to = c("sp2","sp3","sp3","sp3"),
weight = runif(4))
If I want to get per treatment a graph-wide metric like the diameter I would be tempted to do:
dat %>%
as_tbl_graph() %>%
activate(edges) %>%
group_by(Treatment) %>%
mutate(Diameter = graph_diameter(weights = weight))
But I am unsure about the result as the diameters are then given for each edges while I would be expecting one measure per treatment (per graph).
Similarly if I want to derive some metrics like the connectivity of each node for each treatment this appears to be not so straightforward since the treatment variables is dropped from the nodes table. I have been trying various hacks like pasting the treatment IDs to the from and to columns before calling as_tbl_graph() along these lines:
dat %>%
mutate(from = paste(from, Treatment, sep = "_"),
to = paste(to, Treatment, sep = "_")) %>%
as_tbl_graph() %>%
mutate(Treatment = substr(name, 5, 5), name = substr(name, 1, 3)) %>%
group_by(Treatment) %>%
mutate(Centrality = centrality_betweenness())
But I got errors that the resulting vectors were of the wrong size (6 instead of 3 or 1).
Is there a way with tidygraph to derive group-level graph-wide and node-level metrics?

I think that this is an interesting problem. I have tried several times and still haven't got the best solution for it. But, I suspect that this problem needs morph() function in order to separate graph properly. Again, I haven't explore that much. But, here I give the simple solution for the diameter problem using morph() function. Hope it helps you.
dat %>%
as_tbl_graph() %>%
activate(edges) %>%
morph(to_split,split_by = "edges")%>%
filter(Treatment=="A") %>%
mutate(Diameter = graph_diameter(weights = weight)) %>%
unmorph() %>% activate(edges) %>%
morph(to_split,split_by = "edges") %>%
filter(Treatment=="B") %>%
mutate(Diameter = graph_diameter(weights = weight)) %>%
unmorph()

Related

e_facet using grouped data in echarts4r question

I really like the possibilities this package offers and would like to use it in a shiny app. however i am struggling to recreate a plot from ggplot to echarts4r
library(tidyverse)
library(echarts4r)
data = tibble(time = factor(sort(rep(c(4,8,24), 30)), levels = c(4,8,24)),
dose = factor(rep(c(1,2,3), 30), levels = c(1,2,3)),
id = rep(sort(rep(LETTERS[1:10], 3)),3),
y = rnorm(n = 90, mean = 5, sd = 3))
This is the plot i am aiming to recreate:
ggplot(data = data, mapping = aes(x = time, y = y, group = id)) +
geom_point() +
geom_line() +
facet_wrap(~dose)
The problem i am having is to make groups of my data using group = id in ggplot syntax in echarts4r . I am aiming to do e_facet on grouped data using group_by() however i can not (or dont know how to) add a group to connect the dots using geom_line()
data %>%
group_by(dose) %>%
e_charts(time) %>%
e_line(y) %>%
e_facet(rows = 1, cols = 3)
You can do this with echarts4r.
There are two methods that I know of that work, one uses e_list. I think that method would make this more complicated than it needs to be, though.
It might be useful to know that e_facet, e_arrange, and e_grid all fall under echarts grid functionality—you know, sort of like everything that ggplot2 does falls under base R's grid.
I used group_split from dplyr and imap from purrr to create the faceted graph. You'll notice that I didn't use e_facet due to its constraints.
group_split is interchangeable with base R's split and either could have been used.
I used imap so I could map over the groups and have the benefit of using an index. If you're familiar with the use of enumerate in a Python for statement or a forEach in Javascript, this sort of works the same way. In the map call, j is a data frame; k is an index value. I appended the additional arguments needed for e_arrange, then made the plot.
library(tidyverse) # has both dplyr and purrrrrr (how many r's?)
library(echarts4r)
data %>% group_split(dose) %>%
imap(function(j, k) {
j %>% group_by(id) %>%
e_charts(time, name = paste0("chart_", k)) %>%
e_line(y, name = paste0("Dose ", k)) %>%
e_color(color = "black")
}) %>% append(c(rows = 1, cols = 3)) %>%
do.call(e_arrange, .)

Visualise in R with ggplot, a k-means clustered developmental gene expression dataset

I can see many posts on this topic, but none addresses this question. Apologies if I missed a relevant answer. I have a large protein expression dataset, with samples like so as the columns:
rep1_0hr, rep1_16hr, rep1_24hr, rep1_48hr, rep1_72hr .....
and 2000+ proteins in the rows. In other words each sample is a different developmental timepoint.
If it is of any interest, the original dataset is 'mulvey2015' from the pRolocdata package in R, which I converted to a SummarizedExperiment object in RStudio.
I first ran k-means clustering on the data (an assay() of a SummarizedExperiment dataset, to get 12 clusters:
k_mul <- kmeans(scale(assay(mul)), centers = 12, nstart = 10)
Then:
summary(k_mul)
produced the expected output.
I would like the visualisation to look like this, with samples on the x-axis and expression on the y-axis. The plots look like they have been generated using facet_wrap() in ggplot:
For ggplot the data need to be provided as a dataframe with a column for the cluster identity of an individual protein. Also the data need to be in long format. I tried pivoting (pivot_longer) the original dataset, but of course there are a very large number of data points. Moreover, the image I posted shows that for any one plot, the number of coloured lines is smaller than the total number of proteins, suggesting that there might have been dimension reduction on the dataset first, but I am unsure. Up till now I have been running the kmeans algorithm without dimension reduction. Can I get guidance please for how to produce this plot?
Here is my attempt at reverse engeneering the plot:
library(pRolocdata)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)
mulvey2015 %>%
Biobase::assayData() %>%
magrittr::extract2("exprs") %>%
data.frame(check.names = FALSE) %>%
tibble::rownames_to_column("prot_id") %>%
mutate(.,
cl = kmeans(select(., -prot_id),
centers = 12,
nstart = 10) %>%
magrittr::extract2("cluster") %>%
as.factor()) %>%
pivot_longer(cols = !c(prot_id, cl),
names_to = "Timepoint",
values_to = "Expression") %>%
ggplot(aes(x = Timepoint, y = Expression, color = cl)) +
geom_line(aes(group = prot_id)) +
facet_wrap(~ cl, ncol = 4)
As for you questions, pivot_longer is usually quite performant unless it fails to find unique combinations in keys or problems related with data type conversion. The plot can be improved by:
tweaking the alpha parameter of geom_lines (e.g. alpha = 0.5), in order to provide an idea of density of lines
finding a good abbreviation and order for Timepoint
changing axis.text.x orientation
Here is my own, very similar solution to the above.
dfsa_mul <- data.frame(scale(assay(mul)))
dfsa_mul2 <- rownames_to_column(dfsa_mul, "protID")
add the kmeans $cluster column to the dfsa_mul2 dataframe. Only change clus to a factor after executing pivot_longer
dfsa_mul2$clus <- ksa_mul$cluster
dfsa_mul2 %>%
pivot_longer(cols = -c("protID", "clus"),
names_to = "samples",
values_to = "expression") %>%
ggplot(aes(x = samples, y = expression, colour = factor(clus))) +
geom_line(aes(group = protID)) +
facet_wrap(~ factor(clus))
This generates a series of plots identical to the graphs posted by #sbarbit.

same y axis variable , scatter-plot and long format

Let's say two different raters are evaluating behavioral problems. They use the same scale (from 0 to 50) and the child being evaluated is the same for both raters. In social sciences, this method is common and there are some useful statistics, such as correlation coefficient and Intra-Class Correlation.
In addition, one graph that comes to my mind is the scatter-plot, and in the x-axys I'll plot the behavioral problems scores considering the first rater and in the y-axis, I'll do the same for the second rater.
gplot2 creates amazing plots, however, some simple routines and action become really difficult to do.
Please see the code below and the r base plot. I would like to know if ggplot can create this plot as well.
Thanks much
set.seed(123)
ds <- data.frame(behavior_problems = rnorm(100,30,2), evaluator=sample(1:2))
plot(ds$behavior_problems[ds$evaluator == '1'] ,
y = ds$behavior_problems[ds$evaluator == '2'])
== I had to edit to make clear why a scatter-plot would be informative==
I think the key problem here is the way in which you have set up the data frame. It is not clear that each individual gets a pair of scores, one from each evaluator. So the first thing to do is add an ID for each individual: 50 IDs in your example data.
library(tidyverse)
ds %>%
mutate(id = rep(1:50, each = 2)
Now we can use tidyr::spread to create a column for each evaluator. But numbers for column names are not a great idea, so we'll rename them to e1 and e2.
ds %>%
mutate(id = rep(1:50, each = 2)) %>%
spread(evaluator, behavior_problems) %>%
rename(e1 = `1`, e2 = `2`)
Now we have column names that can be supplied to ggplot:
ds %>%
mutate(id = rep(1:50, each = 2)) %>%
spread(evaluator, behavior_problems) %>%
rename(e1 = `1`, e2 = `2`) %>%
ggplot(aes(e1, e2)) +
geom_point()
If this seems like a "long way around", it's because ggplot2 works better with "long" data (before the spread) than "wide" (after the spread). To illustrate, here's another way to visualize the difference in scores by individual, which I think works quite well:
ds %>%
mutate(id = rep(1:50, each = 2),
evaluator = factor(evaluator)) %>%
ggplot(aes(id, behavior_problems)) +
geom_point(aes(color = evaluator)) +
geom_line(aes(group = id))

kmeans through time retain consistent cluster ID

There are times when we would like to know how the clustering of points might change through time. For example, say you have cities with demographic attributes by decade and you are interested in what cities are the "most similar" based on their attributes for each decade. Here is a toy dataset that illustrates the point:
set.seed(1)
centers <- data.frame(cluster=factor(1:3),
size=c(100, 150, 50),
x1=c(5, 0, -3),
x2=c(-1, 1, -2))
year1 <- centers %>%
group_by(cluster) %>%
do(data.frame(x1=rnorm(.$size[1], .$x1[1]),
x2=rnorm(.$size[1], .$x2[1]),
year="year 1",
stringsAsFactors = F)) %>%
data.frame()
year2 <- centers %>%
group_by(cluster) %>%
do(data.frame(x1=rnorm(.$size[1], .$x1[1]),
x2=rnorm(.$size[1], .$x2[1]),
year="year 2",
stringsAsFactors = F)) %>%
data.frame()
points <- rbind(year1,year2)
We can calculate kmeans per year using something like below:
kclusters <- points %>%
select(-cluster) %>%
group_by(year) %>%
do(data.frame(., kclust = kmeans(as.matrix(.[,-3]),centers=3)$cluster)) %>%
mutate(kclust = as.character(kclust))
And here is the resulting plot:
ggplot(kclusters) +
geom_point(aes(x1,x2,color=kclust)) +
facet_wrap(~year) +
theme_bw() +
scale_color_viridis_d()
The code works as expected but notice that the cluster IDs have changed. It doesn't make much difference here because I am plotting the clusters using the original x1 and x2, but my real example is making a map and the points are plotted in space using coordinates and colored according to clusters (i.e., the location of the point never changes). Imagine this same plot for several years--it becomes hard to track the changing cluster membership of individual points each year. Is there a way to keep the IDs consistent?

log scale and limits with ggvis

Hi I'm a little confused with the scales in ggvis.
I'm trying to do two things: one is have a log scale (the equivalent of log="x" in plot()). I'm also looking for the equivalent of xlim=c(). In both cases, the code below is not giving the expected results.
# install.packages("ggvis", dependencies = TRUE)
library(ggvis)
df <- data.frame(a=c(1, 2, 3, 1000, 10000), b=c(0.1069, 0.0278, 0.0860, 15.5640, 30.1745))
df %>% ggvis(~a, ~b)
df %>% ggvis(~a, ~b) %>% scale_numeric("x", trans="log")
Notice that with trans="log", all dots are on the left of the plot and the scale disappears.
Next, I want to restrict the plot to certain values. I could subset the data frame but I'm looking to have the equivalent of xlim from plot().
df %>% ggvis(~a, ~b) %>% scale_numeric("x", trans="linear", domain=c(10, 40))
This is giving even weirder results, so I'm guessing I might be misinterpreting what domain does.
Thanks for your help!
I've encountered the same problem that you've mentioned.
Apparently, the developer of ggvis noticed this bug as well. This is currently marked as an issue. You can find the issue here: https://github.com/rstudio/ggvis/issues/230
In the context of your question:
# install.packages("ggvis", dependencies = TRUE)
library(ggvis)
df <- data.frame(a=c(1, 2, 3, 1000, 10000), b=c(0.1069, 0.0278, 0.0860, 15.5640, 30.1745))
df %>% ggvis(~a, ~b)
# Points will disapper
df %>% ggvis(~a, ~b) %>% scale_numeric("x", trans="log")
# Should work
df %>% ggvis(~a, ~b) %>% scale_numeric("x", trans="log", expand=0)
However, you may notice that after the transformation, the spacing between ticks don't appear to be uniform. But at least the dots are rendered correctly.
I ran into the same issue and noticed that as soon as I removed the 0 data points from my data (of which log() cannot be computed) everything started to work fine.
This is strange, I've now tried a lot of things with your data, but can't find the problem.
The I tested it on the Violent Crime Rates by US State (see help(USArrests)) and it worked like a charm.
data(USArrests)
# str(USArrests)
p <- USArrests %>% ggvis(~ Murder, ~ Rape) %>% layer_points()
p %>% scale_numeric("y", trans = "log")
This is not an answer, simply to share this with you.

Resources