how to connect directional sub-graphs in R igraph - r

I have a directional weighted graph that is made of two or more disconnected sub-graphs (with some attributes, in addition to weight):
require(igraph)
df<-data.frame(from=c(1,2,4,5),to=c(2,3,5,6),weight=c(1,1,1,1),attr=c(0.1,0.1,0.1,0.1))
g<-graph_from_data_frame(df,directed=T)
My ultimate goal is to find shortest path, but this can be done only for connected graphs.
As a result, I need to connect these two sub-graphs with an edge between 3 and 4 with the large weight (perhaps vcount(g)) so at the end I have just one graph. In general, vertex names are dates that define direction (from smaller to larger). More than one gap can be present.

You can try the code below if you have more than one gap (i.e., more than two clusters)
e <- c(sapply(decompose(g),function(x) names(V(x))[degree(x)==1]))
G <- g %>%
add.edges(e[2:(length(e)-1)],weight = vcount(g))
such that
> get.data.frame(G)
from to weight attr
1 1 2 1 0.1
2 2 3 1 0.1
3 4 5 1 0.1
4 5 6 1 0.1
5 7 8 1 0.1
6 8 9 1 0.1
7 3 4 9 NA
8 6 7 9 NA
Data
df <-
data.frame(
from = c(1, 2, 4, 5, 7, 8),
to = c(2, 3, 5, 6, 8, 9),
weight = c(1, 1, 1, 1, 1, 1),
attr = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.1)
)

In your case, you could do:
(Thanks for the comment by #ThomasIsCoding
h <- add.edges(g, c("3","4"), weight = vcount(g))

Related

Back tracing parents/paths of two-column data of a tree

I have a tree data serialized like the following:
Relationship: P to C is "one-to-many", and C to P is "one-to-one". So column P may have duplicate values, but column C has unique values.
P, C
1, 2
1, 3
3, 4
2, 5
4, 6
# in data.frame
df <- data.frame(P=c(1,1,3,2,4), C=c(2,3,4,5,6))
1. How do I efficiently implement a function func so that:
func(df, val) returns a vector of full path to root (1 in this case).
For example:
func(df, 3) returns c(1,2,3)
func(df, 5) returns c(1,2,5)
func(df, 6) returns c(1,3,4,6)
2. Alternatively, quickly transforming df to a lookup table like this also works for me:
C, Paths
2, c(1,2)
3, c(1,3)
4, c(1,3,4)
5, c(1,2,5)
6, c(1,2,4,6)
Here is a solution using igraph
library(igraph)
g <- graph_from_data_frame(df)
df <- within(df,
Path <- sapply(match(as.character(C),names(V(g))),
function(k) toString(names(unlist(all_simple_paths(g,1,k))))))
such that
> df
P C Path
1 1 2 1, 2
2 1 3 1, 3
3 3 4 1, 3, 4
4 2 5 1, 2, 5
5 4 6 1, 3, 4, 6

index from one vector to another by closest values

Given two sorted vectors, how can you get the index of the closest values from one onto the other.
For example, given:
a = 1:20
b = seq(from=1, to=20, by=5)
how can I efficiently get the vector
c = (1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)
which, for each value in a, provides the index of the largest value in b that is less than or equal to it. But the solution needs to work for unpredictable (though sorted) contents of a and b, and needs to be fast when a and b are large.
You can use findInterval, which constructs a sequence of intervals given by breakpoints in b and returns the interval indices in which the elements of a are located (see also ?findInterval for additional arguments, such as behavior at interval boundaries).
a = 1:20
b = seq(from = 1, to = 20, by = 5)
findInterval(a, b)
#> [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
We can use cut
as.integer(cut(a, breaks = unique(c(b-1, Inf)), labels = seq_along(b)))
#[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

Is there a function to know how many times a column has the best value?

I have a data.frame like this :
A B C
4 8 2
1 3 5
5 7 6
It could have more column and lines.
So what I'd like to know is for each column how many times they have the lowest values (in my example the result should be 2 for A and 1 for C).
d = data.frame(a = c(4, 1, 5), b = c(8, 3, 7), c = c(2, 5, 6))
row_mins = apply(d, 1, min)
# alternately, slightly more efficient
row_mins = do.call(pmin, d)
colSums(d == row_mins)
# a b c
# 2 0 1

Error in isoMDS(d): zero or negative distance between objects

I'm trying to do a nonmetric MDS (R version 3.3.3) using the isoMDS function in the MASS package and I get this error:
Error in isoMDS(d): zero or negative distance between objects 1 and 2
Here's an example of what I'm doing:
# LOAD LIBRARY
library(MASS)
# CREATE FAKE DATA
a <- c(1, 1, 1, 1)
b <- c(2, 2, 2, 2)
c <- c(3, 3, 4, 5)
d <- c(4, 4, 7, 9)
x <- data.frame(a, b, c, d)
x
a b c d
1 1 2 3 4
2 1 2 3 4
3 1 2 4 7
4 1 2 5 9
# EUCLIDEAN DISTANCE BETWEEN ROWS 1, 2, 3 and 4
d <- dist(x)
d
1 2 3
2 0.000000
3 3.162278 3.162278
4 5.385165 5.385165 2.236068
# NMDS
fit <- isoMDS(d)
Error in isoMDS(d) : distance négative ou nulle entre les objets 1 et 2
I don't know if there's a way of getting around this issue or if I'm doing something wrong. I understand that objects 1 and 2 are identical and that that's probably why the distance is negative or equals to zero. I found out that my question was a "FAQ", but one of the only answers I found is this:
Short answer: you cannot compare distances including NAs, so there is no
way to find a monotone mapping of distances. If the data really are identical for two rows, you can easily drop one of
them whilst doing MDS, and then assign the position found for one to the
other.
So, my next questions are: how do you drop rows whilst doing MDS, and is there any other way to perform a NMDS?
Any help would be greatly appreciated!
The dist function computes the distances between the rows of a data matrix.
Your a, b, c, and d vectors are the columns of the x matrix, not the rows.
A simple solution is to transpose x:
library(MASS)
a <- c(1, 1, 1, 1)
b <- c(2, 2, 2, 2)
c <- c(3, 3, 4, 5)
d <- c(4, 4, 7, 9)
x <- data.frame(a, b, c, d)
# Calculate distance between the columns
d <- dist(t(x))
# NMDS
fit <- isoMDS(d)
# initial value 0.000000
# final value 0.000000
# converged
fit
# $points
# [,1] [,2]
# a -4.594429 0.4509513
# b -2.770312 -0.3638885
# c 1.098884 -0.3114594
# d 6.265857 0.2243966
#
# $stress
# [1] 7.976932e-15
I hope it can help you.
As you noted, you have identical rows.
You can omit identical rows when you first create the distance matrix
d <- dist(x[-1,])
Then continue as normal
fit <- isoMDS(d)
Alternatively, you could try the vegan::metaMDS function:
library(vegan)
#> This is vegan 2.5-3
x <- data.frame(a = c(1, 1, 1, 1),
b = c(2, 2, 2, 2),
c = c(3, 3, 4, 5),
d = c(4, 4, 7, 9))
# The warnings are expected for such a small dataset
fit <- vegan::metaMDS(comm = dist(x))
#> ... Procrustes: rmse 0.09543314 max resid 0.108719
#> *** No convergence -- monoMDS stopping criteria:
#> 17: stress < smin
#> 3: scale factor of the gradient < sfgrmin
#> Warning in vegan::metaMDS(comm = dist(x)): stress is (nearly) zero: you may
#> have insufficient data
ordiplot(fit, type = "text")
Variables/columns "a" and "b" (1 and 2) get the same coordinates.
Similarly, using the smacof::mds function:
library(smacof)
fit2 <- smacof::mds(delta = dist(x), type = "ordinal")
fit2$conf
#> D1 D2
#> 1 0.5742535 0.007220978 # 1 & 2 get the same coordinates
#> 2 0.5742535 0.007220978
#> 3 -0.2749314 -0.034928060
#> 4 -0.8735757 0.020486105

calculating simple retention in R

For the dataset test, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.
> test
user_id period
1 1 1
2 5 1
3 1 1
4 3 1
5 4 1
6 2 2
7 3 2
8 2 2
9 3 2
10 1 2
11 5 3
12 5 3
13 2 3
14 1 3
15 4 3
16 5 4
17 5 4
18 5 4
19 4 4
20 3 4
For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.
The output would be the following:
> output
period retention
1 1 NA
2 2 0.500
3 3 0.666
4 4 0.500
The test data:
> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5,
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
How about this? First split the users by period, then write a function that calculates the proportion carryover between any two periods, then loop it through the split list with mapply.
splt <- split(test$user_id, test$period)
carryover <- function(x, y) {
length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])
1 2 3
0.5000000 0.6666667 0.5000000
Here is an attempt using dplyr, though it also uses some standard syntax in the summarise:
test %>%
group_by(period) %>%
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>%
mutate(retention=lag(retention))
This returns:
period retention
<dbl> <dbl>
1 1 NA
2 2 0.5000000
3 3 0.6666667
4 4 0.5000000
This isn't so elegant but it seems to work. Assuming df is the data frame:
# make a list to hold unique IDS by
uniques = list()
for(i in 1:max(df$period)){
uniques[[i]] = unique(df$user_id[df$period == i])
}
# hold the retention rates
retentions = rep(NA, times = max(df$period))
for(j in 2:max(df$period)){
retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}
Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.

Resources