This may look simple but I am not able to do this. I want to plot two types of nodes, small and big using a cutoff. The values are in the attribute degree.
Here is a small toy example
g1 <- graph(edges=c(1,2, 2,3, 3, 1, 4,2), n=4, directed=F) %>%
set_vertex_attr("names", value = LETTERS[1:4])
g1_degree <- degree(g1, mode = "total")
g1_degree
[1] 2 3 2 1
g1 <- set_vertex_attr(g1, "degree", value = g1_degree)
plot(g1, vertex.size=V(g1)$degree)
This gives me every node according to the degree, but I want nodes of degree 2 and 3 big and 1 small.
So I tried to edit the values within V(g1)$degree
ifelse(V(g1)$degree < 2, yes = V(g1)$degree==1, no = V(g1)$degree==5)
FALSE FALSE FALSE TRUE
Ok, I checked my degree values, but how can I overwrite the TRUE or FALSE using the cutoffs I need?
Here are two solutions.
One with ifelse, like in the question.
g1 <- set_vertex_attr(g1, "degree", value = ifelse(V(g1)$degree < 2, 1, 5))
V(g1)$degree
#[1] 5 5 5 1
And another with findInterval. This has better performance than ifelse, which can be important in large data sets.
i <- findInterval(V(g1)$degree, c(0, 2, Inf))
g1 <- set_vertex_attr(g1, "degree", value = c(1, 5)[i])
V(g1)$degree
#[1] 5 5 5 1
With different new sizes set with the findInterval index, c(10, 50)[i], the graph would look like below.
g1 <- set_vertex_attr(g1, "degree", value = c(10, 50)[i])
plot(g1, vertex.size = V(g1)$degree)
Related
I am trying to create an edge list based on binary splits.
If I have a data frame that only contains the node number and some other metric, then I can manually create an edge list for the nodes. For example, if my data frame looks like this:
dfTest <- data.frame(
node = c(1,2,3,4,5),
var = c("milk", NA, "coffee", NA, NA),
isLeaf = c(F, T, F, T, T)
)
> dfTest
node var isLeaf
1 1 milk FALSE
2 2 <NA> TRUE
3 3 coffee FALSE
4 4 <NA> TRUE
5 5 <NA> TRUE
Then, based on the var or isLeaf column, I can manually create an edge list to connect the nodes. For example, As node 2 is a leaf, I know that node 1 must go to node 2. Then (as they are binary splits) I know node 1 must also connect to node 3. And as node 4 and 5 are leaf nodes, I know that they must split on node 3.
Manually creating the edge list would look like this:
edges <- data.frame(
from = c(1, 1, 3, 3),
to = c(2, 3, 4, 5)
)
The to column is easy to find... it will always be c(2:length(dfTest$nodes)). In this case 2,3,4,5. But the from column is proving difficult to find.
Just for a visual aid, the resulting tree would look like this:
Is there any way to do this without having to manually work out the edges?
EDIT:
In response to an answer, I'm adding a slightly larger dataset to use:
dfTest <- data.frame(
node = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11),
var = c("milk", "milk", NA, NA, "coffee", "sugar", NA, NA, "milk", NA, NA),
isLeaf = c(F, F, T, T, F, F, T, T, F, T, T)
)
A little explanation:
From the var column I know that milk (the root/node 1) splits to another milk (node 2). I can then see that node 2 splits to NA (node 3) and NA (node 4). As I know they are binary splits, I know that node 2 cant split any further. So, I must go back to the previous node that only had 1 split… in this case node 1 (i.e., milk) which then splits to the right on coffee (node 5). Again, as they are binary splits, I now know that coffee (node 5) must split to sugar (node 6). Sugar (node 6) is followed by 2 NAs (node 7 & 8 ). Now, I must go back to coffee (node 5) and split to the right to get milk (node 9) which splits to 2 NAs (node 10 &11)
The desired node/edge list should look like this:
edges <- data.frame(
from = c(1,2,2,1,5,6,6,5,9,9),
to = c(2,3,4,5,6,7,8,9,10,11)
)
Which would produce a tree like this:
As per your update, perhaps you can try the code below
grps <- split(dfTest, ~ cumsum(!isLeaf))
edges <- do.call(
rbind,
lapply(
grps,
function(x) {
with(x, expand.grid(from = node[!isLeaf], to = node[isLeaf]))
}
)
)
for (k in seq_along(grps)) {
if (nrow(grps[[k]]) == 1) {
lleaf <- with(grps[[k + 1]], node[!isLeaf])
rleaf <- with(grps[[k + 2]], node[!isLeaf])
edges <- rbind(edges, data.frame(from = grps[[k]]$node, to = c(lleaf, rleaf)))
}
}
edges <- `row.names<-`(edges[with(edges, order(from, to)), ], NULL)
which gives
> edges
from to
1 1 2
2 1 5
3 2 3
4 2 4
5 5 6
6 5 9
7 6 7
8 6 8
9 9 10
10 9 11
Visualization
library(igraph)
graph_from_data_frame(edges) %>%
plot(layout = layout_as_tree)
I have a large data set that I want to represent with a network graph using igraph. I just don't understand how to get the colors right. Is it possible to get an igraph plot with edge having the same color as vertex color? I my example below, I would like to color vertex and edges according to the status 'sampled' or 'unsampled'. An other problem is that all the edge do not appear on the igraph, and I don't understand why
My code so far is:
d <- data.frame(individual=c(1:10), mother_id = c(0,0,0,0,0,1,3,7,6,7), father_id = c(0,0,0,0,0,4,1,6,7,6) , generation = c(0,0,0,0,0,1,1,2,2,2), status=c("sampled","unsampled","unsampled","sampled",'sampled',"sampled","unsampled","unsampled","sampled",'sampled'))
#Just some settings for layout plot
g <- d$generation
n <- nrow(d)
pos <- matrix(data = NA, nrow = n, ncol = 2)
pos[, 2] <- max(g) - g
pos[, 1] <- order(g, partial = order(d$individual, decreasing = TRUE)) - cumsum(c(0, table(g)))[g + 1]
#Plotting the igraph
G <- graph_from_data_frame(d)
plot(G, rescale = T, vertex.label = d$individual, layout = pos,
edge.arrow.mode = "-",
vertex.color = d$status,
edge.color = d$status,
asp = 0.35)
My question is somewhat similar to this question, but I would like to do it with igraph package.
Ggraph node color to match edge color
Thanks for your help
if you plot(G) you will see that the graph from data frame object is not what you expect, most likely. That is why you dont see all edges (i.e the column father_id is not used at all).
By default igraph takes the first column as "from" and the second one as "to". That is why you see 1to0, 2to0 and so on.
You can fix this by passing in two objects, one with the edges and their attributes, and one with the nodes and their attributes.
It is not so clear to me where the edges should be. However, your code should look something like this:
dd <- read.table(text = "
from to type
1 6 A
3 7 B
7 8 A
6 9 B
7 10 A
4 6 B
1 7 A
6 8 B
7 9 B
6 10 A ", header=T )
nodes <- data.frame(id=unique(c(dd$from, dd$to)) )
nodes$type <- sample(LETTERS[1:2], 8, replace = T )
nodes$x <- c(8,3,5,7,1,2,4,10) # this if for the layout
nodes$y <- c(1, 2, 4, 5, 6, 8, 5, 7)
nodes
id type x y
1 1 B 8 1
2 3 A 3 2
3 7 B 5 4
4 6 A 7 5
5 4 A 1 6
6 8 B 2 8
7 9 A 4 5
8 10 A 10 7
G <- graph_from_data_frame(dd, vertices = nodes ) # directed T or F?
V(G)$color <- ifelse( V(G)$type == "A", "pink", "skyblue")
E(G)$color <- ifelse( E(G)$type == "A", "pink", "skyblue")
edge_attr(G)
vertex_attr(G)
plot(G)
Consider a clustering problem, where the true class labels are known (say g).
Suppose, p denotes the predicted cluster labels (can be obtained by any clustering approach).
So, both g and p splits the data set in some groups, though the number of groups need not be same in two cases.
Among these two sets of groups, in some cases one group by g will be identical to another group by p, though their labels in two cases may be different. I want to find the number of such groups, i.e. I want to find the number of cases where the clustering method is able to detect a class perfectly.
I understand this is not a standard way to evaluate clustering (Rand Index, Dunn Index, etc. are recommended), but I am interested in this. I also understand that this number will be very small in most of the real life data, may be even 0, but the data set I am currently working with has a large number (around 1500) of classes, with highest number of observations in one class being at most 15. So, in this case, this number is likely to be quite high.
Here is a reproducible example and my attempt (working) at the solution:
# true labels
g <- c(1, 1, 2, 2, 2, 1, 3, 3, 3, 4)
# predicted labels
p <- c(3, 3, 1, 1, 1, 3, 4, 4, 1, 2)
# correctly detected groups
n_correct <- 2 # (1st class and 3rd cluster), (4th class and 2nd cluster)
# attempt
distinct_class_labels <- unique(x = g)
counter <- 0
for (i in seq_along(along.with = distinct_class_labels))
{
cluster_labels_of_obs_in_ith_class <- subset(x = p,
subset = (g == distinct_class_labels[i]))
unique_cluster_labels_of_obs_in_ith_class <- unique(x = cluster_labels_of_obs_in_ith_class)
if (length(x = unique_cluster_labels_of_obs_in_ith_class) == 1)
{
class_labels_of_obs_in_this_cluster <- subset(x = g,
subset = (p == unique_cluster_labels_of_obs_in_ith_class))
if (length(x = unique(x = class_labels_of_obs_in_this_cluster)) == 1)
{
counter <- (counter + 1)
}
}
}
counter
#> [1] 2
Created on 2019-05-22 by the reprex package (v0.3.0)
This works correctly, but it takes time (and I do not like this method). I suppose one can use dplyr::group_by with both g and p separately and somehow compare the groups of these two objects. I guess there are other better approaches to this and I will highly appreciate such answers.
Thanks.
If you are also interested in the combination of the correctly detected groups you can try this
library(tidyverse)
tibble(g = g, p=p) %>%
distinct(g,p) %>% # unique combinations of g and p
add_count(g, name="g_count") %>% # count how often each class/label occurs in g and p. When it is unambiguous assigned it should be 1
add_count(p, name="p_count") %>%
filter(g_count == 1 & p_count == 1) %>%
select(g,p)
# A tibble: 2 x 2
g p
<dbl> <dbl>
1 1 3
2 4 2
The number of rows (you can use nrow()) will give you the number of correctly detected groups
Convert g and p to factor with levels specified based on their occurrence in the vector and count the frequencies that match.
sum(table(factor(p, levels = unique(p))) == table(factor(g, levels = unique(g))))
#[1] 2
To understand, see
table(factor(p, levels = unique(p)))
#3 1 4 2
#3 4 2 1
table(factor(g, levels = unique(g)))
#1 2 3 4
#3 3 3 1
We can ignore the labels (as the group labels are not same) and focus only on frequency. We can see that the first and fourth value have the same frequency hence, the count 2.
If you want to find out which groups are similar, you can do
inds <- table(factor(p, levels = unique(p))) == table(factor(g, levels = unique(g)))
unique(p)[inds]
#[1] 3 2
unique(g)[inds]
#[1] 1 4
This says that group 3 in p is similar to group 1 in g and same for 2 and 4 respectively.
Before solving it using table , I did it with split although the underlying logic is the same.
sum(lengths(split(p, factor(p, levels = unique(p)))) ==
lengths(split(g, factor(g, levels = unique(g)))))
EDIT
If there is a chance of class imbalance we need to combine the levels to include all. For example,
g1 <- c(g, 5)
p1 <- c(p, 1)
sum(table(factor(p1, levels = unique(c(p1, g1)))) ==
table(factor(g1, levels = unique(c(g1, p1)))))
#[1] 2
I have a data frame with a group of x and y points. I need to calculate the euclidean distance of every point relative to every other point. Then I have to figure, for each row, how many are within a given range.
For example, if I had this data frame:
x y
- -
1 2
2 2
9 9
I should add a column that signals how many points (if we consider these points to be in a cartesian plane) are within a distance of 3 units from every other point.
x y n
- - -
1 2 1
2 2 1
9 9 0
Thus, the first point (1,2) has one other point (2,2) that is within that range, whereas the point (9,9) has 0 points at a distance of 3 units.
I could do this with a couple of nested for loops, but I am interested in solving this in R in an idiomatic way, preferably using dplyr or other library.
This is what I have:
ddply(.data=mydataframe, .variables('x', 'y'), .fun=count.in.range)
count.in.range <- function (df) {
xp <- df$x
yp <- df$y
return(nrow(filter(df, dist( rbind(c(x,y), c(xp,yp)) ) < 3 )))
}
But, for some reason, this doesn't work. I think it has to do with filter.
Given
df_ <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
You can use the function "dist":
matrix_dist <- as.matrix(dist(df_))
df_$n <- rowSums(matrix_dist <= 3)
This is base approach with straightforward application of a "distance function" but only on a row-by-row basis:
apply( df_ , 1, function(x) sum( (x[1] - df_[['x']])^2+(x[2]-df_[['y']])^2 <=9 )-1 )
#[1] 1 1 0
It's also really a "sweep" operation, although I wouldn't really expect a performance improvement.
I would suggest you work with pairs of points in the long format and then use a data.table solution, which is probably one of the fastest alternatives to work with large datasets
library(data.table)
library(reshape)
df <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
The first thing you need to do is to reshape your data to long format with all possible combinations of pairs of points:
df_long <- expand.grid.df(df,df)
# rename columns
setDT(df_long )
setnames(df_long, c("x","y","x1","y1"))
Now you only need to do this:
# calculate distance between pairs
df_long[ , mydist := dist ( matrix(c(x,x1,y,y1), ncol = 2, nrow = 2) ) , by=.(x,y,x1,y1)]
# count how many points are within a distance of 3 units
df_long[mydist <3 , .(count = .N), by=.(x,y)]
#> x y count
#> 1: 1 2 2
#> 2: 2 2 2
#> 3: 9 9 1
I'm trying to make several wordclouds to compare terms, which themselves are nested within groups. I would like to make one wordcloud per group. The wordcloud package in R can make the wordclouds I need, but each new wordcloud has the size of the words scaled relatively to the maximum and minimum word frequency. This is able to be set with the scale parameter.
My aim is to make wordclouds where the size of the word is absolutely related to the frequency of the word, enabling different wordclouds to be visually compared.
library(wordcloud)
dat <- data.frame(word = rep(LETTERS[1:3], 2), freq = c(10, 5, 3, 20, 10, 6), group = c(1, 1, 1, 2, 2, 2))
dat
# word freq group
#1 A 10 1
#2 B 5 1
#3 C 3 1
#4 A 20 2
#5 B 10 2
#6 C 6 2
wordcloud(dat$word[dat$group == 1], dat$freq[dat$group == 1])
wordcloud(dat$word[dat$group == 2], dat$freq[dat$group == 2]) # Currently the same
This is the current wordcloud I get from the above command, run on both groups in the MWE (although the exact placement will vary randomly with each run). I would like each letter in the second group's wordcloud to be twice as large as the first, in line with the data (or for there to be some sensible scaled difference, even if it is not linear).
How can this be achieved?
Hm, this might be a roundabout way. But what if we set the scale for all data using a single anchor.
anchor <- max(dat$freq)
wordcloud(dat$word[dat$group == 1], dat$freq[dat$group == 1], scale = c(8*max(dat$freq[dat$group == 1])/anchor, 0.5))
wordcloud(dat$word[dat$group == 2], dat$freq[dat$group == 2], scale = c(8*max(dat$freq[dat$group == 2])/anchor, 0.5))