Problem with upset plot intersection numbers - r

I have four sets A, B, C and D like below:
A <- c("ENSG00000103472", "ENSG00000130600", "ENSG00000177335", "ENSG00000177337",
"ENSG00000178977", "ENSG00000180139", "ENSG00000180539", "ENSG00000187621",
"ENSG00000188511", "ENSG00000197099", "ENSG00000203446", "ENSG00000203739",
"ENSG00000203804", "ENSG00000204261", "ENSG00000204282", "ENSG00000204584",
"ENSG00000205056", "ENSG00000205837", "ENSG00000206337", "ENSG00000213057")
B <- c("ENSG00000146521", "ENSG00000165511", "ENSG00000174171", "ENSG00000176659",
"ENSG00000179428", "ENSG00000179840", "ENSG00000180539", "ENSG00000204261",
"ENSG00000204282", "ENSG00000204949", "ENSG00000206337", "ENSG00000223534",
"ENSG00000223552", "ENSG00000223725", "ENSG00000226252", "ENSG00000226751",
"ENSG00000226777", "ENSG00000227066", "ENSG00000227260", "ENSG00000227403")
C <- c("ENSG00000167912", "ENSG00000168405", "ENSG00000172965", "ENSG00000177234",
"ENSG00000177699", "ENSG00000177822", "ENSG00000179428", "ENSG00000179840",
"ENSG00000180139", "ENSG00000181800", "ENSG00000181908", "ENSG00000183674",
"ENSG00000189238", "ENSG00000196668", "ENSG00000196979", "ENSG00000197301",
"ENSG00000203446", "ENSG00000203999", "ENSG00000204261", "ENSG00000206337")
D <- c("ENSG00000122043", "ENSG00000162888", "ENSG00000167912", "ENSG00000176320",
"ENSG00000177699", "ENSG00000179253", "ENSG00000179428", "ENSG00000179840",
"ENSG00000180539", "ENSG00000181800", "ENSG00000185433", "ENSG00000188511",
"ENSG00000189238", "ENSG00000197301", "ENSG00000205056", "ENSG00000205562",
"ENSG00000213279", "ENSG00000214922", "ENSG00000215533", "ENSG00000218018")
An upset plot gave me following result:
library(UpSetR)
mine <- list("A" = A,
"B" = B,
"C" = C,
"D" = D)
upset(fromList(mine), keep.order = TRUE)
But I'm interested in looking at intersections between specific sets. A & B, A & C, A & D. So, I did it like below:
upset(fromList(mine), intersections = list(list("A"),list("B"),list("C"),
list("D"),list("A", "B"),
list("A", "C"),
list("A", "D")), keep.order = TRUE)
But, the common between A & B are 4, A & C are 4 and A & D are 3. Why the above upset plot show wrong numbers?
How to make it right showing correct common number? I don't want the common between all sets.

The numbers are correct! The issue is very specific and complex.
There are different ways to calculate set intersection size:
"distinct" mode
"intersect" mode
"union" mode
UpSetR uses the "distinct" mode.
The "intersect" mode may be what the user expects.
ComplexHeatmap and ComplexUpset packages allows the user to choose which mode to use.
I found a real sufficient explanation by Jakob Rosenthal here https://github.com/hms-dbmi/UpSetR/issues/72 especially this graphic:

Related

"Preserving" edge attributes with in built function

This post is related to a previous question.
The basic problem in that post was how to connect nodes by common retweeted users. The code suggested by #ThomasIsCoding does work.
My follow up question is how to make this function store edge attributes. I provide a toy example below:
My initial dataframe is of the form:
author.id<-c("A","B","D")
rt_user<-c("C","C","C")
id<-c(1,2,3)
example<-data.frame(author.id,rt_user,id)
Where nodes A,B,D retweet from node C and the id column is a numeric classifier for the tweets. Using this in a network structure and applying the aforementioned function leads:
g <- graph.data.frame(example, directed = T)
edge.attributes(g)
gres <- do.call(
graph.union,
lapply(
names(V(g))[degree(g, mode = "out") == 0],
function(x) {
nbs <- names(V(g))[distances(g, v = x, mode = "in") == 1]
disjoint_union(
set_vertex_attr(make_full_graph(length(nbs)), name = "name", value = nbs),
set_vertex_attr(make_full_graph(1), name = "name", value = x)
)
}
)
)
plot(gres)
edge.attributes(gres)
My goal is to "preserve" the edge attributes of g in the final transformation of gres. I want to know that A used tweet 1, B used tweet 2 and D tweet 3. I believe now they should be transformed from edge to vertex attributes so that one still knows who is the user and which is the tweet but I am not sure on how to incorporate this into the code.
You can try the code below
gres <- set_vertex_attr(gres,
name = "id",
value = E(g)$id[match(names(V(gres)), names(tail_of(g, E(g))))]
)
where the edge attributes are characterized vertices from tail_of, and you will see
> V(gres)$id
[1] 1 2 3 NA

Using cpquery function for several pairs from dataset

I am relatively beginner in R and trying to figure out how to use cpquery function for bnlearn package for all edges of DAG.
First of all, I created a bn object, a network of bn and a table with all strengths.
library(bnlearn)
data(learning.test)
baynet = hc(learning.test)
fit = bn.fit(baynet, learning.test)
sttbl = arc.strength(x = baynet, data = learning.test)
Then I tried to create a new variable in sttbl dataset, which was the result of cpquery function.
sttbl = sttbl %>% mutate(prob = NA) %>% arrange(strength)
sttbl[1,4] = cpquery(fit, `A` == 1, `D` == 1)
It looks pretty good (especially on bigger data), but when I am trying to automate this process somehow, I am struggling with errors, such as:
Error in sampling(fitted = fitted, event = event, evidence = evidence, :
logical vector for evidence is of length 1 instead of 10000.
In perfect situation, I need to create a function that fills the prob generated variable of sttbl dataset regardless it's size. I tried to do it with for loop to, but stumbled over the error above again and again. Unfortunately, I am deleting failed attempts, but they were smt like this:
for (i in 1:nrow(sttbl)) {
j = sttbl[i,1]
k = sttbl[i,2]
sttbl[i,4]=cpquery(fit, fit$j %in% sttbl[i,1]==1, fit$k %in% sttbl[i,2]==1)
}
or this:
for (i in 1:nrow(sttbl)) {
sttbl[i,4]=cpquery(fit, sttbl[i,1] == 1, sttbl[i,2] == 1)
}
Now I think I misunderstood something in R or bnlearn package.
Could you please tell me how to realize this task with filling the column by multiple cpqueries? That would help me a lot with my research!
cpquery is quite difficult to work with programmatically. If you look at the examples in the help page you can see the author uses eval(parse(...)) to build the queries. I have added two approaches below, one using the methods from the help page and one using cpdist to draw samples and reweighting to get the probabilities.
Your example
library(bnlearn); library(dplyr)
data(learning.test)
baynet = hc(learning.test)
fit = bn.fit(baynet, learning.test)
sttbl = arc.strength(x = baynet, data = learning.test)
sttbl = sttbl %>% mutate(prob = NA) %>% arrange(strength)
This uses cpquery and the much maligned eval(parse(...)) -- this is the
approach the the bnlearn author takes to do this programmatically in the ?cpquery examples. Anyway,
# You want the evidence and event to be the same; in your question it is `1`
# but for example using learning.test data we use 'a'
state = "\'a\'" # note if the states are character then these need to be quoted
event = paste(sttbl$from, "==", state)
evidence = paste(sttbl$to, "==", state)
# loop through using code similar to that found in `cpquery`
set.seed(1) # to make sampling reproducible
for(i in 1:nrow(sttbl)) {
qtxt = paste("cpquery(fit, ", event[i], ", ", evidence[i], ",n=1e6", ")")
sttbl$prob[i] = eval(parse(text=qtxt))
}
I find it preferable to work with cpdist which is used to generate random samples conditional on some evidence. You can then use these samples to build up queries. If you use likelihood weighting (method="lw") it is slightly easier to do this programatically (and without evil(parse(...))).
The evidence is added in a named list i.e. list(A='a').
# The following just gives a quick way to assign the same
# evidence state to all the evidence nodes.
evidence = setNames(replicate(nrow(sttbl), "a", simplify = FALSE), sttbl$to)
# Now loop though the queries
# As we are using likelihood weighting we need to reweight to get the probabilities
# (cpquery does this under the hood)
# Also note with this method that you could simulate from more than
# one variable (event) at a time if the evidence was the same.
for(i in 1:nrow(sttbl)) {
temp = cpdist(fit, sttbl$from[i], evidence[i], method="lw")
w = attr(temp, "weights")
sttbl$prob2[i] = sum(w[temp=='a'])/ sum(w)
}
sttbl
# from to strength prob prob2
# 1 A D -1938.9499 0.6186238 0.6233387
# 2 A B -1153.8796 0.6050552 0.6133448
# 3 C D -823.7605 0.7027782 0.7067417
# 4 B E -720.8266 0.7332107 0.7328657
# 5 F E -549.2300 0.5850828 0.5895373

How to group factor levels?

I have a factor column with football position abbreviations, around 17 unique values with 220 observations. I want to have only three factor levels which encompass the 17 unique values.
levels(nfldraft$Pos) <- list(Linemen = c("C","OG","OT","TE","DT","DE"),
Small_Backs = c("CB","WR","FS"),
Big_Backs = c("FB","ILB","OLB","P","QB",
"RB","SS","WR"))
is what I tried, printing nfldraft$Pos to the console shows 3 factor levels but all the values are either "Linemen" or "Small_Backs" and all the other ones are NA. Where am I going wrong?
I made up an example character vector with all of the abbreviations:
my_example <- c("C","OG","OT","TE","DT","DE","CB","WR","FS",
"FB","ILB","OLB","P","QB","RB","SS","WR")
class(my_example)
[1] "character"
Then I substituted the desired levels for their abbreviations (you could also use gsub here or any of many, many different approaches):
my_example[my_example %in% c("C","OG","OT","TE","DT","DE")] <- "Linemen"
my_example[my_example %in% c("CB","WR","FS")] <- "Small Backs"
my_example[my_example %in% c("FB","ILB","OLB","P",
"QB","RB","SS","WR")] <- "Big Backs"
Then I made it into a factor:
my_example <- as.factor(my_example)
head(my_example)
[1] Linemen Linemen Linemen Linemen Linemen Linemen
Levels: Big Backs Linemen Small Backs
tail(my_example)
[1] Big Backs Big Backs Big Backs Big Backs Big Backs Small Backs
Levels: Big Backs Linemen Small Backs
class(my_example)
[1] "factor"
This is a good example of needing a fully reproducible example. Actually OP's code looks like it should work. Taking from #Hack-R's sample input:
my_example <- c("C","OG","OT","TE","DT","DE","CB","WR","FS",
"FB","ILB","OLB","P","QB","RB","SS","WR")
OP's original code works as-is:
nfldraft = list(Pos = factor(my_example))
levels(nfldraft$Pos) <- list(
Linemen = c("C","OG","OT","TE","DT","DE"),
Small_Backs = c("CB","WR","FS"),
Big_Backs = c("FB","ILB","OLB","P","QB","RB","SS","WR")
)
table(nfldraft$Pos)
# Linemen Small_Backs Big_Backs
# 6 2 9
This is exactly in line with the documentation for how to use levels<-:
levels(x) <- value
value A valid value for levels(x)... For the factor method, a vector of character strings with length at least the number of levels of x, or a named list specifying how to rename the levels.
So it seems there's something else wrong with OP's input
You can also use the mapvalues() function from dplyr package.
In your example it would be:
Linemen_levels = c("C","OG","OT","TE","DT","DE")
Small_Backs_levels = c("CB","WR","FS")
Big_Backs_levels = c("FB","ILB","OLB","P","QB","RB","SS","WR")
nfldraft <- nfldraft %>% mutate(Pos=mapvalues(Pos,
from = c(Linemen_levels, Small_Backs_levels, Big_Backs_levels),
to = c(rep('Linemen', length(Linemen_levels), rep('Small_Backs', length(Small_Backs_levels), rep('Big_Backs', length(Big_Backs_levels))))))

Tinkerpop 3: compute connected components with Gremlin traversal

I think the tags explain quite well my problem :)
I've been trying to write a Gremlin traversal to compute the connected components of the simple graph described at the end of the post.
I tried with
g.V().repeat(both('e')).until(cyclicPath()).dedup().tree().by('name').next()
obtaining
==>a={b={a={}, c={b={}}, d={c={d={}}}}, c={d={c={}}}}
==>e={f={e={}, g={f={}}}, h={f={h={}}}}
==>g={f={g={}}}
which is bad, since the cyclicPath filter terminated the traversal starting from e before reaching g.
Obviously, if I remove the until clause I get an infinite loop.
Moreover, if I use simplePath the traverse ends after one step.
Is there any way to tell it to explore the nodes in depth-first order?
Cheers!
a = graph.addVertex(T.id, 1, "name", "a")
b = graph.addVertex(T.id, 2, "name", "b")
c = graph.addVertex(T.id, 3, "name", "c")
d = graph.addVertex(T.id, 4, "name", "d")
e = graph.addVertex(T.id, 5, "name", "e")
f = graph.addVertex(T.id, 6, "name", "f")
g = graph.addVertex(T.id, 7, "name", "g")
h = graph.addVertex(T.id, 8, "name", "h")
a.addEdge("e", b)
a.addEdge("e", c)
b.addEdge("e", c)
b.addEdge("e", d)
c.addEdge("e", d)
e.addEdge("e", f)
e.addEdge("e", h)
f.addEdge("e", h)
f.addEdge("e", g)
This query was also discussed in the Gremlin-users group.
Here is the solution I came out with.
#Daniel Kuppitz also had an interesting solution you can find in the mentioned thread.
I think that if it is always true that in an undirected graph the "last" node of the traversal of a connected component either leads to a previously visited node (cyclicPath()) or has degree <=1 this query should work
g.V().repeat(both('e')).until( cyclicPath().or().both('e').count().is(lte(1)) ).dedup().tree().by('name').next()
On my example it gives the following output
gremlin> g.V().repeat(both('e')).until(cyclicPath().or().both('e').count().is(lte(1))).dedup().tree().by('name').next()
==>a={b={a={}, c={b={}}, d={c={d={}}}}, c={d={c={}}}}
==>e={f={e={}, g={}, h={f={}}}, h={f={h={}}}}
Just to enhance the #Alberto version, which is working well, you can use the simplePath() traversal step (http://tinkerpop.apache.org/docs/current/reference/#simplepath-step) to ensure that the traverser does not repeat its path through the graph
g.V().repeat(both().simplePath())
.until(bothE().count().is(lte(1)))
.dedup().tree().by('name').next()

Annotated correlation tables with stargazer

I want to report correlation tables in a latex report and I'm using 'stargazer' to transform my R objects into tex-code. The correlational data is currently stored in a data frame.
I would like to print rownames and possibly add an annotation under the table. I couldn't find a 'print rownames'-argument and the 'notes'-argument doesn't seem to work.
Any Ideas?
## create object
x <- matrix(1:4, 2, byrow = TRUE)
dimnames(x) <- list(c("A", "B"), c("A", "B"))
x <- data.frame(x)
## create Tex-Code
stargazer(x, summary = FALSE, title = "2x2 Matrix",
notes = "This is a two by two Matrix")
As of version 5.0, stargazer can directly output the content of matrices/vectors. The following code should provide an easy and intuitive resolution to your problem:
## create object
x <- matrix(1:4, 2, byrow = TRUE)
dimnames(x) <- list(c("A", "B"), c("A", "B"))
## create Tex-Code
stargazer(x, title = "2x2 Matrix",
notes = "This is a two by two Matrix")
This is rather a markdown solution that can be converted to LaTeX with e.g. Pandoc:
> require(pander)
> pander(x, caption = 'Annotation')
---------------
A B
------- --- ---
**A** 1 2
**B** 3 4
---------------
Table: Annotation
To get the 'rownames', try this hackish solution:
## create object
x <- matrix(1:4, 2, byrow = TRUE)
x <- data.frame(x)
x <- cbind(c("A","B"),x)
colnames(x) <- c("","A", "B")
## create Tex-Code
stargazer(x, summary = FALSE, title = "2x2 Matrix",
notes = "This is a two by two Matrix", type="text")
At the moment (v. 4.5.1), 'stargazer' is best suited to working with regression tables and data frames. Your question, however, suggests that users might be interested in better support for matrices. Expect this in future releases (next few months).
As for notes, these really only work for regression tables at the moment. However, they will be available for summary statistics and data frame tables in the next release. If you're willing to edit the source, you can get something very close (although not quite perfect) to the future implementation by replacing the following line(s):
.format.s.stat.parts <<- c("-!","stat names","-!","statistics1","-!")
by:
.format.s.stat.parts <<- c("-!","stat names","-!","statistics1","-!","notes")

Resources