Graph learning in R, igraph, tidygraph - r

I have a graph with each node having a value (value in red).
I would like to do the following two things (I guess 1 is a special case of 2):
Each node should be assigned the mean of the value of the direct peers directing to it. For example node #5 (1+2)/2=1.5 or node #3 (0+2+0)/3=2/3.
Instead of direct neighbors, include all connected nodes but with a diffusion of times 1/n with n being the distance to the node. The further away the information is coming from the weaker signal we'd have.
I looked into functions of igraph, but could not find anything that is doing this (I might have overseen though). How could I do this computation?
Below is the code for a sample network with random values.
library(tidyverse)
library(tidygraph)
library(ggraph)
set.seed(6)
q <- tidygraph::play_erdos_renyi(6, p = 0.2) %>%
mutate(id = row_number(),
value = sample(0:3, size = 6, replace = T))
q %>%
ggraph(layout = "with_fr") +
geom_edge_link(arrow = arrow(length = unit(0.2, "inches"),
type = "closed")) +
geom_node_label(aes(label = id)) +
geom_node_text(aes(label = value), color = "red", size = 7,
nudge_x = 0.2, nudge_y = 0.2)
Edit, found a solution to 1
q %>%
mutate(value_smooth = map_local_dbl(order = 1, mindist = 1, mode = "in",
.f = function(neighborhood, ...) {
mean(as_tibble(neighborhood, active = 'nodes')$value)
}))
Edit 2, solution to 2, not the most elegant I guess
q %>%
mutate(value_smooth = map_local_dbl(order = 1, mindist = 0, mode = "in",
.f = function(neighborhood, node, ...) {
ne <- neighborhood
ne <- ne %>%
mutate(d = node_distance_to(which(as_tibble(ne,
active = "nodes")$id == node)))
as_tibble(ne, active = 'nodes') %>%
filter(d != 0) %>%
mutate(helper = value/d) %>%
summarise(m = mean(value)) %>%
pull(m)
}))
Edit 3, a faster alternative to map_local_dbl
map_local loops through all nodes of the graph. For large graphs, this takes very long. For just computing the means, this is not needed. A much faster alternative is to use the adjacency matrix and some matrix multiplication.
q_adj <- q %>%
igraph::as_adjacency_matrix()
# out
(q_adj %*% as_tibble(q)$value) / Matrix::rowSums(q_adj)
# in
(t(q_adj) %*% as_tibble(q)$value) / Matrix::colSums(q_adj)
The square of the adjacency matrix is the second order adjacency matrix, and so forth. So a solution to problem 2 could also be created.
Edit 4, direct weighted mean
Say the original graph has weights associated to each edge.
q <- q %>%
activate(edges) %>%
mutate(w = c(1,0.5,1,0.5,1,0.5,1)) %>%
activate(nodes)
We would like to compute the weighted mean of the direct peers' value.
q_adj_wgt <- q %>%
igraph::as_adjacency_matrix(attr = "w")
# out
(q_adj_wgt %*% as_tibble(q)$value) / Matrix::rowSums(q_adj_wgt)
# in
(t(q_adj_wgt) %*% as_tibble(q)$value) / Matrix::colSums(q_adj_wgt)

Probably you can try the code below
q %>%
set_vertex_attr(
name = "value",
value = sapply(
ego(., mode = "in", mindist = 1),
function(x) mean(x$value)
)
)
which gives
# A tbl_graph: 6 nodes and 7 edges
#
# A directed simple graph with 1 component
#
# Node Data: 6 x 2 (active)
id value
<int> <dbl>
1 1 0.5
2 2 NaN
3 3 0.667
4 4 NaN
5 5 1.5
6 6 NaN
#
# Edge Data: 7 x 2
from to
<int> <int>
1 3 1
2 6 1
3 1 3
# ... with 4 more rows

Each node should be assigned the mean of the value of the direct peers
directing to it.
Guessing that you really mean
Each node should be assigned the mean of the values of the direct peers directing to it, before any node values were changed
This seems trivial - maybe I am missing something?
Loop over nodes
Sum values of adjacent nodes
Calculate mean and store in vector by node index
Loop over nodes
Set node value to mean stored in previous loop

Related

Need help plotting analytical solution of phytoplankton resource competition model in R

I'm working on a one species, two resources phytoplankton competition model based on Tilman's work in the 70s and 80s. I have a dataframe set up for the analytical solution but am really struggling with the syntax to plot the graphs I need. Here is my code so far:
library(dplyr)
r <- 0.1
g1 <- 0.001
g2 <- 0.01
v1 <- 0.1
v2 <- 1
k1 <- 0.01
k2 <- 0.1
d <- 0.15
s1_star = (r*g1*k1*d)-((v1*(r-d))-r*g1*d)
s2_star = (r*g2*k2*d)-((v2*(r-d))-r*g2*d)
s01 = s1_star+((s02-s2_star)*(g1/g2))
params <- list(r = 0.1,
g1 = 0.001,
g2 = 0.01,
d = 0.5,
v1 = 0.1,
v2 = 1,
k1 = 0.01,
k2 = 0.1)
df <- data.frame(s02 = seq(10, 1, -1)) |>
mutate(
s1_star = (r*g1*k1*d)-((v1*(r-d))-r*g1*d),
s2_star = (r*g2*k2*d)-((v2*(r-d))-r*g2*d),
s01 = s1_star+((s02-s2_star)*(g1/g2)), ## Tilman eq 17, supply concentration of resource 1
## in the reservoir that would result in co-limitation given some concentration of
## resource 2 (s20) in the reservoir
s1_limiting_ratio = s02/s01 ## ratio of supply points that result in co-limitation
)
cbind(params, df) |> as.data.frame() -> limiting_ratio
library(ggplot2)
limiting_ratio |> ggplot(aes(x = s1_star, y = s2_star)) + geom_line()
I want to plot s1_star and s2_star as the axes (which I did), but I'm trying to add the s1_limiting_ratio as a line on the graph (it's a ratio of s02/s01, which represents when resource 1 (S1) and resource 2 (S2) are co-limited. Then, I want to plot various values of s01 and s02 on the graph to see where they fall (to determine which resource is limiting to know which resource equation to use, either S1 or S2, in the analytical solution.
I've tried googling ggplot help, and struggling to apply it to the graph I need. I'm still fairly new to R and definitely pretty new to ggplot, so I really appreciate any help and advice!

optimize R code for min() and sample() by group

I generate a network with npeople(=80), ncomp(=4) components and I want each component to have density equal to dens(=0.2).
I want to optimize 2 lines of the code which take most of the time (especially if I want to have 5k people in the network).
the 2 lines are:
# adjust probability to keep density
nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
# simulate edges
nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]
I have tried using the lapply() function, but the execution time increased - see below the line of code:
nodes[,lapply(.SD, function(p) min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
rm(list=ls())
library(data.table)
library(intergraph)
library(igraph)
library(Matrix)
library(profvis)
library(ggplot2)
draw.var <- function(n, var1, rho, mean){
C <- matrix(rho, nrow = 2, ncol = 2)
diag(C) <- 1
C <- chol(C)
S <- rnorm(n, mean = mean)
S <- cbind(scale(var1)[1:n],S)
ZS <- S %*% C
return(ZS[,2])
}
set.seed(1123)
profvis({
# create empty list to store data
dt.list <- list()
npeople <- 500
dens <- .2
OC.impact <- FALSE
cor_iv_si <- .6
cor_iv_uc <- 0
cor_uc_oc <- 0.6
ncomp <- 4
beta_oc <- 2 # observed characteristics
beta_uc <- 2 # unobserved characteristics
beta_si <- 1
# create data.table
dt.people <- data.table(ego = 1:npeople)
# draw observed characteristics
dt.people[, OC := abs(rt(npeople,2))]
# draw unobserved variable
dt.people[, UC := draw.var(npeople, dt.people$OC, rho = cor_uc_oc,mean = 5)]
# set component idientifier
dt.people$group <- cut_number(dt.people$UC, ncomp,labels = F)
for(q in 1:ncomp){
# subset comp
dt.sub <- dt.people[group == q]
# create undirected graph
nodes <- as.data.table(t(combn(dt.sub$ego, 2)))
setnames(nodes,c("ID","ALTERID"))
# add attributes
nodes <- merge(nodes,dt.people[,list(ID = ego, ID.UC = UC, ID.OC = OC)], by = "ID")
nodes <- merge(nodes,dt.people[,list(ALTERID = ego, ALTERID.UC = UC, ALTERID.OC = OC)], by = "ALTERID")
# calculate distance
nodes[,d := abs(ID.UC - ALTERID.UC)]
# estimate the appropiate density per component
n.edges <- (dens * (npeople * (npeople - 1)))/ncomp
n.nodes <- npeople/ncomp
c.dens <- n.edges/(n.nodes * (n.nodes - 1))
# estimate initial probability of tie based on distance
coefficient <- log(c.dens / (1 - c.dens))
alpha <- coefficient / mean(nodes$d)
nodes[,p := exp(alpha * d) / (1 + exp(alpha * d))]
# adjust probability to keep density
nodes[,p:= as.numeric(min(c(1, p * (1/(mean(nodes$p) / c.dens))))), by = c("ID","ALTERID")]
# simulate edges
nodes[, edge := sample(c(0,1),1, prob = c(1-p,p)), by = c("ID","ALTERID")]
# keep the edges
nodes <- nodes[edge == 1,list(ID,ALTERID)]
# bind the networks
if(q == 1){
net <- copy(nodes)
} else{
net <- rbind(net,nodes)
}
}
# create opposide direction
net <- rbind(net,net[,list(ID = ALTERID, ALTERID = ID)])
})
This incorporates #BenBolker and # DavidArenburg's suggestions and also incorporates some of data.table's tools.
Non-Equi joins
The OP code loops through each group. One part of the code also uses combn and multiple joins to get the data in the right format. Using non-equi joins, we can combine all of those steps in one data.table call
dt_non_sub <- dt.people[dt.people,
on = .(ego < ego, group = group),
allow.cartesian = T,
nomatch = 0L,
.(group,
ALTERID = i.ego, ID = x.ego,
ID.UC = UC, ID.OC = OC,
ALTERID.OC = i.OC, ALTERID.UC = i.UC,
d = abs(UC - i.UC)) #added to be more efficient
]
# dt_non_sub[, d:= abs(ID.UC - ALTERID.UC)]
Vectorization
The original code was mostly slow because of two calls with by groupings. Since each call split the dataframe in around 8,000 individual groups, there were 8,000 functions calls each time. This eliminates those by using pmin as suggested by #DavidArenburg and then uses runif(N)<p as suggested by #BenBolker. My addition was that since your final result don't seem to care about p, I only assigned the edge by using {} to only return the last thing calculated in the call.
# alpha <- coefficient / mean(nodes$d)
dt_non_sub[,
edge := {
alpha = coefficient / mean(d)
p = exp(alpha * d) / (1 + exp(alpha * d))
p_mean = mean(p)
p = pmin(1, p * (1/(p_mean / c.dens)))
as.numeric(runif(.N)<p)
}
, by = .(group)]
net2 <- rbindlist(dt_non_sub[edge == 1, .(group, ALTERID, ID)],
dt_non_sub[edge == 1, .(group, ID = ALTERID, ALTERID = ID)]
One thing to note is that the vectorization is not 100% identical. Your code was recursive, each split updated the mean(node$p) for the next ID, ALTERID group. If you need that recursive part of the call, there's not much help to make it faster.
In the end, the modified code runs in 20 ms vs. the 810 ms of your original function. The results, while different, are somewhat similar in the total number of results:
Original:
net
ID ALTERID
1: 5 10
2: 10 14
3: 5 25
4: 10 25
5: 14 25
---
48646: 498 458
48647: 498 477
48648: 498 486
48649: 498 487
48650: 498 493
Modified
net2
group ALTERID ID
1: 2 4 3
2: 2 6 4
3: 4 7 1
4: 4 8 7
5: 2 9 4
---
49512: 3 460 500
49513: 3 465 500
49514: 3 478 500
49515: 3 482 500
49516: 3 497 500

Building a life table in R with for loops

I'm new to R and programming in general, and I'm struggling with a for-loop for building the lx function in a life table.
I have the age function x, the death function qx (the probability that someone aged exactly x will die before reaching age x+1), and the surviving function px = 1 - qx.
I want to write a function that returns a vector with all the lx values from first to last age in my table. The function is simple...
I've defined cohort = 1000000. The first age in my table is x = 5, so, considering x = 5...
l_(x) = cohort
And, from now on, l_(x+n) = l_(x+n-1)*p_(x+n-1)
I've searched about for-loops, and I can only get my code working for lx[1] and lx[2], and I get nothing for lx[n] if n > 2.
I wrote that function:
living_x <- function(px, cohort){
result <- vector("double", length(px))
l_x <- vector("double", length(px))
for (i in 1:length(px)){
if (i == 1){
l_x[i] = cohort
}
else l_x[i] = l_x[i-1]*px[i-1]
result[i] = l_x
print(result)
}
}
When I run it, I get several outputs (more than length(px)) and "There were 50 or more warnings (use warnings() to see the first 50)".
When I run warnings(), I get "In result[i] <- l_x : number of items to replace is not a multiple of replacement length" for every number.
Also, everything I try besides it give me different errors or only calculate lx for lx[1] and lx[2]. I know there's something really wrong with my code, but I still couldn't identify it. I'd be glad if someone could give me a hint to find out what to change.
Thank you!
Here's an approach using dplyr from the tidyverse packages, to use px to calculate lx. This can be done similarly in "Base R" using excerpt$lx = 100000 * cumprod(1 - lag(excerpt$qx)).
lx is provided in the babynames package, so we can check our work:
library(tidyverse)
library(babynames)
# Get excerpt with age, qx, and lx.
excerpt <- lifetables %>%
filter(year == 2010, sex == "F") %>%
select(x, qx_given = qx, lx_given = lx)
excerpt
# A tibble: 120 x 3
x qx_given lx_given
<dbl> <dbl> <dbl>
1 0 0.00495 100000
2 1 0.00035 99505
3 2 0.00022 99471
4 3 0.00016 99449
5 4 0.00012 99433
6 5 0.00011 99421
7 6 0.00011 99410
8 7 0.0001 99399
9 8 0.0001 99389
10 9 0.00009 99379
# ... with 110 more rows
Using that data to estimate lx_calc:
est_lx <- excerpt %>%
mutate(px = 1 - qx_given,
cuml_px = cumprod(lag(px, default = 1)),
lx_calc = cuml_px * 100000)
And finally, comparing visually the given lx with the one calculated based on px. They match exactly.
est_lx %>%
gather(version, val, c(lx_given, lx_calc)) %>%
ggplot(aes(x, val, color = version)) + geom_line()
I could do it in a very simple way after thinking for some minutes more.
lx = c()
for (i in 2:length(px)){
lx[1] = 10**6
lx[i] = lx[i-1]*px[i-1]
}

Lift curve is swapped

For the example for the lift curve I run
library(caret)
set.seed(1)
simulated <- data.frame(obs = factor(rep(letters[1:2], each = 100)),
perfect = sort(runif(200), decreasing = TRUE),
random = runif(200))
lift2 <- lift(obs ~ random + perfect, data = simulated)
xyplot(lift2, plot = "lift", auto.key = list(columns = 2))
and get
as result. I expected the image to be swapped horizontally - something along the lines of
What am I doing wrong?
Btw: This is a lift chart not a cumulative gains chart.
Update:
The plot that I expected, produced now by my own code
mylift <- caret::lift(Class ~ cforest_prob + perfect_prob + guess_prob, data = data_test)
ggplot(mylift$data) +
geom_line(aes(CumTestedPct, lift, color = liftModelVar))
is
I noticed, that the data.frame mylift$data contains the following columns:
names(mylift$data)
[1] "liftModelVar" "cuts" "events" "n" "Sn" "Sp" "EventPct"
[8] "CumEventPct" "lift" "CumTestedPct"
So I printed the following plot
ggplot(mylift$data) +
geom_line(aes(cuts, lift, color = liftModelVar))
So I guess that the different plots are just different ways of examining lift? I wasn't aware that there are different lift charts - I thought it was standardized across the industry.
Edit by the question author, for late readers: I accepted this answer for a large part because of the helpful discussion in the comments to this answer. Please consider reading the discussion!
Let's reproduce the graph and find the baseline. Let
cutoffs <- seq(0, 1, length = 1000)
be our cutoffs. Now the main computations are done by
aux <- sapply(cutoffs, function(ct) {
perf <- simulated$obs[simulated$perfect > ct]
rand <- simulated$obs[simulated$random > ct]
c(mean(perf == "a"), mean(rand == "a"))
})
where we go over the vector of cutoffs and do the following. Take the perfect case. We say that whenever perfect > ct, we are going to predict "a". Then simulated$obs[simulated$perfect > ct] are the true values, while mean(perf == "a") is our accuracy with a given cutoff. The same happens with random.
As for the baseline, it is just a constant defined by the share of "a" in the sample:
baseline <- mean(simulated$obs == "a")
When plotting the lifts, we divide our accuracy by that of the baseline method and get the same graph along with the baseline curve:
plot(x = cutoffs, y = aux[1, ] / baseline, type = 'l', ylim = c(0, 2), xlab = "Cutoff", ylab = "Lift")
lines(x = cutoffs, y = aux[2, ] / baseline, col = 'blue')
abline(a = baseline / baseline, b = 0, col = 'magenta')
Update:
Here's an illustration that, at least when plotted manually, the lift curve of the "expected" type can be manipulated and gives non-unique results.
Your example graph is from here, which also has this data:
# contacted response
# 1 10000 6000
# 2 20000 10000
# 3 30000 13000
# 4 40000 15800
# 5 50000 17000
# 6 60000 18000
# 7 70000 18800
# 8 80000 19400
# 9 90000 19800
# 10 100000 20000
Now suppose that we know not this evolution but 10 individual blocks:
# contacted response
# 1 10000 6000
# 2 10000 4000
# 3 10000 3000
# 4 10000 2800
# 5 10000 1200
# 6 10000 1000
# 7 10000 800
# 8 10000 600
# 9 10000 400
# 10 10000 200
In that case it depends on how we order the observations when putting "% Contacted" in the x-axis:
set.seed(1)
baseline <- sum(df$response) / sum(df$contacted) * cumsum(df$contacted)
lift1 <- cumsum(df$response)
lift2 <- cumsum(sample(df$response))
x <- 1:10 * 10
plot(x = x, y = lift1 / baseline, col = 'red', type = 'l', ylim = c(0, 3), xlab = "% Customers contacted", ylab = "Lift")
lines(x = x, y = lift2 / baseline, col = 'blue')
abline(a = baseline / baseline, b = 0, col = 'magenta')

Using approx() with groups in dplyr

I am trying to use approx() and dplyr to interpolate values in an existing array. My initial code looks like this ...
p = c(1,1,1,2,2,2)
q = c(1,2,3,1,2,3)
r = c(1,2,3,4,5,6)
Inputs<- data.frame(p,q,r)
new.inputs= as.numeric(c(1.5,2.5))
library(dplyr)
Interpolated <- Inputs %>%
group_by(p) %>%
arrange(p, q) %>%
mutate(new.output=approx(x=q, y=r, xout=new.inputs)$y)
I expect to see 1.5, 2.5, 4.5, 5.5 but instead I get
Error: incompatible size (2), expecting 3 (the group size) or 1
Can anyone tell me where I am going wrong?
You can get the values you expect using dplyr.
library(dplyr)
Inputs %>%
group_by(p) %>%
arrange(p, q, .by_group = TRUE) %>%
summarise(new.outputs = approx(x = q, y = r, xout = new.inputs)$y)
# p new.outputs
# <dbl> <dbl>
# 1 1.5
# 1 2.5
# 2 4.5
# 2 5.5
You can also get the values you expect using the ddply function from plyr.
library(plyr)
# Output as coordinates
ddply(Inputs, .(p), summarise, new.output = paste(approx(
x = q, y = r, xout = new.inputs
)$y, collapse = ","))
# p new.output
# 1 1.5,2.5
# 2 4.5,5.5
#######################################
# Output as flattened per group p
ddply(Inputs,
.(p),
summarise,
new.output = approx(x = q, y = r, xout = new.inputs)$y)
# p new.output
# 1 1.5
# 1 2.5
# 2 4.5
# 2 5.5

Resources