Most elegant way to convert lists into igraph object for plotting - r

I am new to igraph and it seems to be a very powerful (and therefore also complex) package.
I tried to convert the following lists into an igraph object.
graph <- list(s = c("a", "b"),
a = c("s", "b", "c", "d"),
b = c("s", "a", "c", "d"),
c = c("a", "b", "d", "e", "f"),
d = c("a", "b", "c", "e", "f"),
e = c("c", "d", "f", "z"),
f = c("c", "d", "e", "z"),
z = c("e", "f"))
weights <- list(s = c(3, 5),
a = c(3, 1, 10, 11),
b = c(5, 3, 2, 3),
c = c(10, 2, 3, 7, 12),
d = c(15, 7, 2, 11, 2),
e = c(7, 11, 3, 2),
f = c(12, 2, 3, 2),
z = c(2, 2))
Interpretation is as follows: s is the starting node, it links to nodes a and b. The edges are weighted 3 for s to a and 5 for s to b and so on.
I tried all kinds of functions from igraph but only got all kinds of errors. What is the most elegant and easy way to convert the above into an igraph object for plotting the graph?

Create an edgelist and then a graph from that. Assign the weights and plot it.
set.seed(123)
e <- as.matrix(stack(graph))
g <- graph_from_edgelist(e)
E(g)$weight <- stack(weights)[[1]]
plot(g, edge.label = E(g)$weight)

Related

Adding extra track to outside of circos plot (circlize, chordDiagram)

I'm trying to recreate this figure below, where the "to" variable (i.e. target genes) is further grouped into outer (labelled) categories (i.e. receptors).
I have generated some example data, unfortunately I'm not sure what format is needed for the additional outer categories, but it's possibly not far off the link format.
library(circlize)
links <- data.frame(from = c("A", "B", "C", "B", "C"),
to = c("D", "E", "F", "D", "E"),
value = c(1, 1, 1, 1, 1))
categories <- data.frame(from = c("D", "E", "F", "D", "E"),
to = c("X", "X", "Y", "Y", "Y"),
value = c(1, 1, 1, 1, 1))
chordDiagram(links)
Any assistance greatly appreciated!

Find the overlap of two datasets

I have two different datasets as I've shown below: df_A and df_B.
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
Now, I want to see the overlap of these two datasets on book_name. Namely, I want to make a list that shows us the book_name that are both in the datasets and also how similar these two datasets according to the book_name column.
Is there any idea to do this in an accurate way?
You can do an inner join between the two dataframes which automatically gives you the intersection between the two dataframes.
This should do the trick,
library(dplyr)
# Creating first data frame
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
# Creating second data frame
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
# Joining between the two dataframes to get the common values between the two
result <-
df_A %>%
inner_join(df_B, by = "book_name")
Here is a base R solution, where maybe you can use intersect(), i.e.,
overlap <- subset(df_A,book_name %in% intersect(book_name,df_B$book_name))
such that
> overlap
# A tibble: 3 x 2
book_name sales_id
<chr> <dbl>
1 A 1
2 C 3
3 E 5

Removing "unused" nodes in sankey network

I am trying to build a sankey network.
This is my data and code:
library(networkD3)
nodes <- data.frame(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "D", "E", "N", "O", "P", "Q", "R"))
names(nodes) <- "name"
nodes$name = as.character(nodes$name)
links <- data.frame(matrix(
c(0, 2, 318.167,
0, 3, 73.85,
0, 4, 51.1262,
0, 5, 6.83333,
0, 6, 5.68571,
0, 7, 27.4167,
0, 8, 4.16667,
0, 9, 27.7381,
1, 10, 627.015,
1, 3, 884.428,
1, 4, 364.211,
1, 13, 12.33333,
1, 14, 9,
1, 15, 37.2833,
1, 16, 9.6,
1, 17, 30.5485), nrow=16, ncol=3, byrow = TRUE))
colnames(links) <- c("source", "target", "value")
links$source = as.integer(links$source)
links$target = as.integer(links$target)
links$value = as.numeric(links$value)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
fontSize = 12, fontFamily = 'Arial', nodeWidth = 20)
The problem is that A and B only have common links to D and E.
Although the links are correctly displayed, D and E are also shown at the right-bottom.
How can I avoid this ?
Note: If I specify
nodes <- data.frame(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "N", "O", "P", "Q", "R"))
no network at all is created.
Nodes must be unique, see below example. I removed repeated nodes: "D" and "E", then in links, I removed links that reference to nodes that do not exist. We have only 16 nodes, zero based 0:15. And in your links dataframe, you have last 2 rows referencing to 16 and 17.
Or as #CJYetman (networkD3 author) comments:
Another way to say it... every node that is in the nodes data frame will be plotted, even if it has the same name as another node, because the index is technically the unique id.
library(networkD3)
nodes <- data.frame(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "N", "O", "P", "Q", "R"),
ix = 0:15)
links <- data.frame(matrix(
c(0, 2, 318.167,
0, 3, 73.85,
0, 4, 51.1262,
0, 5, 6.83333,
0, 6, 5.68571,
0, 7, 27.4167,
0, 8, 4.16667,
0, 9, 27.7381,
1, 10, 627.015,
1, 3, 884.428,
1, 4, 364.211,
1, 13, 12.33333,
1, 14, 9,
1, 15, 37.2833), nrow=14, ncol=3, byrow = TRUE))
colnames(links) <- c("source", "target", "value")
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
fontSize = 12, fontFamily = 'Arial', nodeWidth = 20)

Kruskal-Wallis test: create lapply function to subset data.frame?

I have a data set of values (val) grouped by multiple categories (distance & phase). I would like to test each category by Kruskal-Wallis test, where val is dependent variable, distance is a factor, and phase split my data in 3 groups.
As such, I need to specify the subset of the data within Kruskal-Wallis test and then apply the test to each of groups. BUT, I can not get my subsetting to work!
In R help, it is specified that the subset is an optional vector specifying a subset of observations to be used. But how to correctly put this to my lapply function?
My dummy data:
# create data
val<-runif(60, min = 0, max = 100)
distance<-floor(runif(60, min=1, max=3))
phase<-rep(c("a", "b", "c"), 20)
df<-data.frame(val, distance, phase)
# get unique groups
ii<-unique(df$phase)
# get basic statistics per group
aggregate(val ~ distance + phase, df, mean)
# run Kruskal test, specify the subset
kruskal.test(df$val ~df$distance,
subset = phase == "c")
This works well, so my subset should be correctly set as a vector.
But how to use this in a lapply function?
# DOES not work!!
lapply(ii, kruskal.test(df$val ~ df$distance,
subset = df$phase == as.character(ii)))
My overall goal is to create a function from kruskal.test, and save all statistics for each group into one table.
All help is highly appreciated.
Usually you would start by splitting, and then lapplying.
Something like
lapply(split(df, df$phase), function(d) { kruskal.test(val ~ distance, data=d) })
would yield a list, indexed by the phase, of the results of kruskal.test.
Your final expression does not work because lapply expects a function, and applying kruskal.test does not result in a function, it results in the result of running that test. If you surround it with a function definition with the index, then it would work, just be a little less idiomatic.
lapply(ii, function(i) { kruskal.test(df$val ~ df$distance, subset=df$phase==i )})
Though it is late, it might help someone having the same problem. So, I am putting an answer implemented using tidyverse and rstatix packages. The rstatix package which "provides a simple and intuitive pipe friendly framework, coherent with the 'tidyverse' design philosophy for performing basic statistical tests".
library(rstatix)
library(tidyverse)
df %>%
group_by(phase) %>%
kruskal_test(val ~ distance)
Output
# A tibble: 3 x 7
phase .y. n statistic df p method
* <chr> <chr> <int> <dbl> <int> <dbl> <chr>
1 a val 20 0.230 1 0.631 Kruskal-Wallis
2 b val 20 0.0229 1 0.88 Kruskal-Wallis
3 c val 20 0.322 1 0.570 Kruskal-Wallis
which is same as provided by #user295691.
Data
df = structure(list(val = c(93.8056977232918, 31.0681172646582, 40.5262873973697,
47.6368983509019, 65.23181500379, 64.4571609096602, 10.3301600087434,
90.4661140637472, 41.2359046051279, 28.3357713604346, 49.8977075796574,
10.8744730940089, 5.31001624185592, 71.9248640118167, 99.0267782937735,
73.7928744405508, 3.31214582547545, 40.2693636715412, 27.6980920461938,
79.501334275119, 60.5167196830735, 89.9171086261049, 87.4633299885318,
43.1893823202699, 91.1248738644645, 99.755659350194, 7.25280269980431,
96.957387868315, 75.0860505970195, 52.3794749286026, 26.6221587313339,
52.5518182432279, 24.1361060412601, 49.5364486705512, 65.5214034719393,
38.9469220302999, 0.687191751785576, 19.3090825574473, 19.6511475136504,
25.5966754630208, 7.33999472577125, 33.9820940745994, 50.3751677693799,
10.811762069352, 17.2359711956233, 53.958406439051, 64.2723652534187,
92.7404976682737, 26.824192632921, 30.0975760444999, 52.0105463219807,
74.4495407678187, 56.0636054025963, 91.891074879095, 14.0827904455364,
59.3607738381252, 66.5170294465497, 24.1726311156526, 83.0881901318207,
35.5380675755441), distance = c(2, 1, 1, 1, 1, 2, 1, 2, 2, 1,
2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1,
1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1,
1, 2, 1, 1, 2, 2, 2, 2), phase = c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b",
"c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c")), class = "data.frame", row.names = c(NA, -60L))

Using matplot in R whenever certain column changes

Sorry in advance because I am new at asking questions here and don't know how to input this table properly.
Say I have a data frame in R constructed like:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind(team, value)
I want to create a plot that will give me 3 lines graphing the values for teams A, B, and C. I believe I can do this inputting the matrix m into matplot somehow, but I'm not sure how.
EDIT: I've gotten a lot closer to solving my problem. However I've realized that for some reason, with the code I have, "Value" is a list of 745 which matches the number of rows in my dataframe m. However when I unlist(Value) it turns into a numeric of length 894. Any ideas on why this would happen?
You can try something like this:
team = c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
m = cbind.data.frame(team, value)
library(ggplot2)
ggplot(m, aes(x=as.factor(1:nrow(m)), y=value, group=team, col=team)) +
geom_line(lwd=2) + xlab('index')
if you have same number of ordered values for each team, you could use matplot to visualize them. but the data should be converted to matrix first;
m = cbind.data.frame(team, value, index = rep(1:3, 3))
m <- reshape(m, v.names = 'value', idvar = 'team', direction = 'wide', timevar = 'index')
matplot(t(m[, 2:4]), type = 'l', lty = 1)
legend('top', legend = m[, 1], lty = 1, col = 1:3)

Resources