Tinkerpop 3: compute connected components with Gremlin traversal - gremlin

I think the tags explain quite well my problem :)
I've been trying to write a Gremlin traversal to compute the connected components of the simple graph described at the end of the post.
I tried with
g.V().repeat(both('e')).until(cyclicPath()).dedup().tree().by('name').next()
obtaining
==>a={b={a={}, c={b={}}, d={c={d={}}}}, c={d={c={}}}}
==>e={f={e={}, g={f={}}}, h={f={h={}}}}
==>g={f={g={}}}
which is bad, since the cyclicPath filter terminated the traversal starting from e before reaching g.
Obviously, if I remove the until clause I get an infinite loop.
Moreover, if I use simplePath the traverse ends after one step.
Is there any way to tell it to explore the nodes in depth-first order?
Cheers!
a = graph.addVertex(T.id, 1, "name", "a")
b = graph.addVertex(T.id, 2, "name", "b")
c = graph.addVertex(T.id, 3, "name", "c")
d = graph.addVertex(T.id, 4, "name", "d")
e = graph.addVertex(T.id, 5, "name", "e")
f = graph.addVertex(T.id, 6, "name", "f")
g = graph.addVertex(T.id, 7, "name", "g")
h = graph.addVertex(T.id, 8, "name", "h")
a.addEdge("e", b)
a.addEdge("e", c)
b.addEdge("e", c)
b.addEdge("e", d)
c.addEdge("e", d)
e.addEdge("e", f)
e.addEdge("e", h)
f.addEdge("e", h)
f.addEdge("e", g)

This query was also discussed in the Gremlin-users group.
Here is the solution I came out with.
#Daniel Kuppitz also had an interesting solution you can find in the mentioned thread.
I think that if it is always true that in an undirected graph the "last" node of the traversal of a connected component either leads to a previously visited node (cyclicPath()) or has degree <=1 this query should work
g.V().repeat(both('e')).until( cyclicPath().or().both('e').count().is(lte(1)) ).dedup().tree().by('name').next()
On my example it gives the following output
gremlin> g.V().repeat(both('e')).until(cyclicPath().or().both('e').count().is(lte(1))).dedup().tree().by('name').next()
==>a={b={a={}, c={b={}}, d={c={d={}}}}, c={d={c={}}}}
==>e={f={e={}, g={}, h={f={}}}, h={f={h={}}}}

Just to enhance the #Alberto version, which is working well, you can use the simplePath() traversal step (http://tinkerpop.apache.org/docs/current/reference/#simplepath-step) to ensure that the traverser does not repeat its path through the graph
g.V().repeat(both().simplePath())
.until(bothE().count().is(lte(1)))
.dedup().tree().by('name').next()

Related

"Preserving" edge attributes with in built function

This post is related to a previous question.
The basic problem in that post was how to connect nodes by common retweeted users. The code suggested by #ThomasIsCoding does work.
My follow up question is how to make this function store edge attributes. I provide a toy example below:
My initial dataframe is of the form:
author.id<-c("A","B","D")
rt_user<-c("C","C","C")
id<-c(1,2,3)
example<-data.frame(author.id,rt_user,id)
Where nodes A,B,D retweet from node C and the id column is a numeric classifier for the tweets. Using this in a network structure and applying the aforementioned function leads:
g <- graph.data.frame(example, directed = T)
edge.attributes(g)
gres <- do.call(
graph.union,
lapply(
names(V(g))[degree(g, mode = "out") == 0],
function(x) {
nbs <- names(V(g))[distances(g, v = x, mode = "in") == 1]
disjoint_union(
set_vertex_attr(make_full_graph(length(nbs)), name = "name", value = nbs),
set_vertex_attr(make_full_graph(1), name = "name", value = x)
)
}
)
)
plot(gres)
edge.attributes(gres)
My goal is to "preserve" the edge attributes of g in the final transformation of gres. I want to know that A used tweet 1, B used tweet 2 and D tweet 3. I believe now they should be transformed from edge to vertex attributes so that one still knows who is the user and which is the tweet but I am not sure on how to incorporate this into the code.
You can try the code below
gres <- set_vertex_attr(gres,
name = "id",
value = E(g)$id[match(names(V(gres)), names(tail_of(g, E(g))))]
)
where the edge attributes are characterized vertices from tail_of, and you will see
> V(gres)$id
[1] 1 2 3 NA

Problem with upset plot intersection numbers

I have four sets A, B, C and D like below:
A <- c("ENSG00000103472", "ENSG00000130600", "ENSG00000177335", "ENSG00000177337",
"ENSG00000178977", "ENSG00000180139", "ENSG00000180539", "ENSG00000187621",
"ENSG00000188511", "ENSG00000197099", "ENSG00000203446", "ENSG00000203739",
"ENSG00000203804", "ENSG00000204261", "ENSG00000204282", "ENSG00000204584",
"ENSG00000205056", "ENSG00000205837", "ENSG00000206337", "ENSG00000213057")
B <- c("ENSG00000146521", "ENSG00000165511", "ENSG00000174171", "ENSG00000176659",
"ENSG00000179428", "ENSG00000179840", "ENSG00000180539", "ENSG00000204261",
"ENSG00000204282", "ENSG00000204949", "ENSG00000206337", "ENSG00000223534",
"ENSG00000223552", "ENSG00000223725", "ENSG00000226252", "ENSG00000226751",
"ENSG00000226777", "ENSG00000227066", "ENSG00000227260", "ENSG00000227403")
C <- c("ENSG00000167912", "ENSG00000168405", "ENSG00000172965", "ENSG00000177234",
"ENSG00000177699", "ENSG00000177822", "ENSG00000179428", "ENSG00000179840",
"ENSG00000180139", "ENSG00000181800", "ENSG00000181908", "ENSG00000183674",
"ENSG00000189238", "ENSG00000196668", "ENSG00000196979", "ENSG00000197301",
"ENSG00000203446", "ENSG00000203999", "ENSG00000204261", "ENSG00000206337")
D <- c("ENSG00000122043", "ENSG00000162888", "ENSG00000167912", "ENSG00000176320",
"ENSG00000177699", "ENSG00000179253", "ENSG00000179428", "ENSG00000179840",
"ENSG00000180539", "ENSG00000181800", "ENSG00000185433", "ENSG00000188511",
"ENSG00000189238", "ENSG00000197301", "ENSG00000205056", "ENSG00000205562",
"ENSG00000213279", "ENSG00000214922", "ENSG00000215533", "ENSG00000218018")
An upset plot gave me following result:
library(UpSetR)
mine <- list("A" = A,
"B" = B,
"C" = C,
"D" = D)
upset(fromList(mine), keep.order = TRUE)
But I'm interested in looking at intersections between specific sets. A & B, A & C, A & D. So, I did it like below:
upset(fromList(mine), intersections = list(list("A"),list("B"),list("C"),
list("D"),list("A", "B"),
list("A", "C"),
list("A", "D")), keep.order = TRUE)
But, the common between A & B are 4, A & C are 4 and A & D are 3. Why the above upset plot show wrong numbers?
How to make it right showing correct common number? I don't want the common between all sets.
The numbers are correct! The issue is very specific and complex.
There are different ways to calculate set intersection size:
"distinct" mode
"intersect" mode
"union" mode
UpSetR uses the "distinct" mode.
The "intersect" mode may be what the user expects.
ComplexHeatmap and ComplexUpset packages allows the user to choose which mode to use.
I found a real sufficient explanation by Jakob Rosenthal here https://github.com/hms-dbmi/UpSetR/issues/72 especially this graphic:

Using cpquery function for several pairs from dataset

I am relatively beginner in R and trying to figure out how to use cpquery function for bnlearn package for all edges of DAG.
First of all, I created a bn object, a network of bn and a table with all strengths.
library(bnlearn)
data(learning.test)
baynet = hc(learning.test)
fit = bn.fit(baynet, learning.test)
sttbl = arc.strength(x = baynet, data = learning.test)
Then I tried to create a new variable in sttbl dataset, which was the result of cpquery function.
sttbl = sttbl %>% mutate(prob = NA) %>% arrange(strength)
sttbl[1,4] = cpquery(fit, `A` == 1, `D` == 1)
It looks pretty good (especially on bigger data), but when I am trying to automate this process somehow, I am struggling with errors, such as:
Error in sampling(fitted = fitted, event = event, evidence = evidence, :
logical vector for evidence is of length 1 instead of 10000.
In perfect situation, I need to create a function that fills the prob generated variable of sttbl dataset regardless it's size. I tried to do it with for loop to, but stumbled over the error above again and again. Unfortunately, I am deleting failed attempts, but they were smt like this:
for (i in 1:nrow(sttbl)) {
j = sttbl[i,1]
k = sttbl[i,2]
sttbl[i,4]=cpquery(fit, fit$j %in% sttbl[i,1]==1, fit$k %in% sttbl[i,2]==1)
}
or this:
for (i in 1:nrow(sttbl)) {
sttbl[i,4]=cpquery(fit, sttbl[i,1] == 1, sttbl[i,2] == 1)
}
Now I think I misunderstood something in R or bnlearn package.
Could you please tell me how to realize this task with filling the column by multiple cpqueries? That would help me a lot with my research!
cpquery is quite difficult to work with programmatically. If you look at the examples in the help page you can see the author uses eval(parse(...)) to build the queries. I have added two approaches below, one using the methods from the help page and one using cpdist to draw samples and reweighting to get the probabilities.
Your example
library(bnlearn); library(dplyr)
data(learning.test)
baynet = hc(learning.test)
fit = bn.fit(baynet, learning.test)
sttbl = arc.strength(x = baynet, data = learning.test)
sttbl = sttbl %>% mutate(prob = NA) %>% arrange(strength)
This uses cpquery and the much maligned eval(parse(...)) -- this is the
approach the the bnlearn author takes to do this programmatically in the ?cpquery examples. Anyway,
# You want the evidence and event to be the same; in your question it is `1`
# but for example using learning.test data we use 'a'
state = "\'a\'" # note if the states are character then these need to be quoted
event = paste(sttbl$from, "==", state)
evidence = paste(sttbl$to, "==", state)
# loop through using code similar to that found in `cpquery`
set.seed(1) # to make sampling reproducible
for(i in 1:nrow(sttbl)) {
qtxt = paste("cpquery(fit, ", event[i], ", ", evidence[i], ",n=1e6", ")")
sttbl$prob[i] = eval(parse(text=qtxt))
}
I find it preferable to work with cpdist which is used to generate random samples conditional on some evidence. You can then use these samples to build up queries. If you use likelihood weighting (method="lw") it is slightly easier to do this programatically (and without evil(parse(...))).
The evidence is added in a named list i.e. list(A='a').
# The following just gives a quick way to assign the same
# evidence state to all the evidence nodes.
evidence = setNames(replicate(nrow(sttbl), "a", simplify = FALSE), sttbl$to)
# Now loop though the queries
# As we are using likelihood weighting we need to reweight to get the probabilities
# (cpquery does this under the hood)
# Also note with this method that you could simulate from more than
# one variable (event) at a time if the evidence was the same.
for(i in 1:nrow(sttbl)) {
temp = cpdist(fit, sttbl$from[i], evidence[i], method="lw")
w = attr(temp, "weights")
sttbl$prob2[i] = sum(w[temp=='a'])/ sum(w)
}
sttbl
# from to strength prob prob2
# 1 A D -1938.9499 0.6186238 0.6233387
# 2 A B -1153.8796 0.6050552 0.6133448
# 3 C D -823.7605 0.7027782 0.7067417
# 4 B E -720.8266 0.7332107 0.7328657
# 5 F E -549.2300 0.5850828 0.5895373

Translate from lpSolve to lpSolveAPI Package

The goal: Use current lpSolve code to create a new code using the lpSolveAPI package.
The background: I have been using lpSolve to find an optimal solution, for purposes of creating fantasy sports contest lineups, which maximizes the projected points (DK) of the players on the team versus the maximum allowed total salary (SALARY) - with a handful of other constraints to fit the rules of the contest. I have discovered in a few instances, however, lpSolve fails to find the most optimal solution. It seemingly overlooks the best points/dollar solution for some unknown reason and finds only the nth best solution instead. Unfortunately, I do not have an example of this as I had issues with my archive drive recently and lost quite a bit of data.
My research/ask: I have read other threads here that have had similar issues with lpSolve (like this one here). In those instances, lpSolveAPI was able to see the optimal solution when lpSolve could not. Not being familiar with lpSolveAPI, I am looking for assistance from someone familiar with both packages in converting my current code to instead take advantage of the lpSolveAPI package and eliminate lpSolve oversight going forward. I have tried but, for some reason, I keep getting lost in the translation.
My lpSolve code:
# count the number of unique teams and players
unique_teams = unique(slate_players$TEAM)
unique_players = unique(slate_players$PLAYERID)
# define the objective for the solver
obj = slate_players$DK
# create a constraint matrix for the solver
con = rbind(t(model.matrix(~ POS + 0, slate_players)), #Positions
t(model.matrix(~ PLAYERID + 0, slate_players)), #DupPlayers
t(model.matrix(~ TEAM + 0, slate_players)), #SameTeam
rep(1,nrow(slate_players)), #TotPlayers
slate_players$SALARY) #MaxSalary
# set the direction for each of the constraints
dir = c("==", #1B
"==", #2B
"==", #3B
"==", #C
"==", #OF
"==", #SP
"==", #SS
rep('<=',length(unique_players)), #DupPlayers
rep('<=',length(unique_teams)), #SameTeam
"==", #TotPlayers
"<=") #MaxSalary
# set the limits for the right-hand side of the constraints
rhs = c(1, #1B
1, #2B
1, #3B
1, #C
3, #OF
2, #SP
1, #SS
rep(1,length(unique_players)), #DupPlayers
rep(5,length(unique_teams)), #SameTeam
10, #TotPlayers
50000) #MaxSalary
# find the optimal solution using the solver
result = lp("max", obj, con, dir, rhs, all.bin = TRUE)
# create a data frame for the players in the optimal solution
solindex = which(result$solution==1)
optsolution = slate_players[solindex,]
Thank you for your help!
This should be straightforward:
library(lpSolveAPI)
ncons <- nrow(con)
nvars <- length(obj)
lprec <- make.lp(nrow=ncons, ncol=ncols)
set.objfn(lprec, obj)
set.type(lprec, 1:nvars, "binary") # all.bin=TRUE
for (i in 1:ncons) {
set.row(lprec, row=i, xt=con[i,])
set.constr.type(lprec, dir[i], constraints=i)
set.rhs(lprec, b=rhs[i], constraints=i)
}
status <- solve(lprec)
if(status!=0) stop("no solution found, error code=", status)
sol <- get.variables(lprec)
This code is untested since your question has missing data references and no expected solution.

module 'tensorflow' has no attribute 'variable' in R

I am making a neural network in R so I can predict future data.
Firstly, I made a function that makes the layers:
add_layer <- function(x, in_size, out_size, act_function){
w = tf$V=variable(tf$random_normal(shape(in_size, out_size)))
b = tf$variable(tf$random_normal(shape(1, out_size)))
wxb = tf$matmul(x,w)+ b
y = act_function(wxb)
return(y)
}
Then, I create the layers. For now, I create 2 layers:
x = tf$placeholder(tf$float32, shape(NULL,31))
ty = tf$placeholder(tf$float32, shape(NULL, 2))
#First layer
l1 = add_layer(x, 31, 10, tf$nn$relu)
#Second layer, result is 0(false) or 1(true)
l = add_layer(l1, 10,2, tf$nn$sotfmax)
But then there is an error when I make layer l1 and layer l:
AttributeError: module 'tensorflow' has no attribute 'variable'
The problem is, when I remove in_size or out_size, it gives me the error that these are missing. Do I add these two then it gives me this error. After filling all the parameters(in_size, out_size, x and the activation function) it still gives me variable missing as seen above.
Any suggestions how to solve this?
Edit: Changed capital letter v, but result is still the same

Resources