igraph adjacent edges based on attributes - r

For each vertex, I am interested in the number of its adjacent edges based on a condition. In the following example, the condition is having a different gender.
Example:
library(igraph)
library(ggraph)
library(tidyverse)
nodes <- tibble(id = 1:4,
gender = c("M", "F", "F", "M"),
names = c("Bob", "Allie", "Mary", "Johnathon"))
edges <- tibble(from = c(1, 3, 2, 4, 1, 2, 1, 4),
to = c(2, 2, 4, 1, 3, 1, 4, 3))
network <- graph_from_data_frame(d = edges, vertices = nodes, directed = TRUE)
ggraph(network) +
geom_edge_link(arrow = arrow(length = unit(4,
'mm')),
start_cap = circle(4, 'mm'),
end_cap = circle(4, 'mm')) +
geom_node_text(aes(label = names)) +
theme_graph()
Desired result:
id name adjacent_edges
1 Bob 1
2 Allie 1
3 Mary 2
4 Johnathon 1

Here's an approach combining base R with igraph:
nodes %>%
mutate(adjacent_edges = colSums(as.matrix(outer(gender, gender, `!=`) * as_adj(network)) != 0))
# A tibble: 4 x 4
# id gender names adjacent_edges
# <int> <chr> <chr> <dbl>
# 1 1 M Bob 1
# 2 2 F Allie 1
# 3 3 F Mary 2
# 4 4 M Johnathon 1
Here
outer(gender, gender, `!=`)
builds a matrix with TRUE entries when genders are different, while as_adj(network)) is the usual graph adjacency matrix. Then their product will have nonzero entries exactly when we want - in the case of connected nodes with different genders. Summing over such cases gives the desired result.
Here's another one, lengthier but also more transparent:
edges %>% full_join(nodes, by = c("from" = "id")) %>%
full_join(nodes, by = c("to" = "id"), suff = c(".from", ".to")) %>%
group_by(to, names.to) %>% summarise(adjacent_edges = sum(gender.to != gender.from)) %>%
rename(id = to, name = names.to)
# A tibble: 4 x 3
# Groups: id [4]
# id name adjacent_edges
# <dbl> <chr> <int>
# 1 1 Bob 1
# 2 2 Allie 1
# 3 3 Mary 2
# 4 4 Johnathon 1
In this case we start with the list of edges and twice add the list of nodes: one time as to have node information about the from edge, and one time as to have node information about the to edge, in the same row. Then it remains to summarise the data by summing all neighbours with different genders.

Related

Nested if statement and logic with strings in csv file in R

I have a csv file that looks like this:
I try to create an algorithm that goes like this:
Iterate through each row;
If the condition is Success,
if T1 == P1, increase score one point
if T2 == P2, increase score one point
if T3 == P3, increase score one point
Else if the condition is Failure,
elif T1 != P1, increase score one point
elif T2 != P2, increase score one point
elif T3 != P3, increase score one point
However, I got stuck on 2 things:
When I say something like:
for (i in 1:4){
if (data[i,7] == "Success")
.......
There is a syntax problem because of using string with logic. How to get it right?
It doesn't calculate correctly when I state something like: if(data[i,1] == data[i,4]) {score = score+1}, but it does calculate correctly if I use numbers instead of letters in the csv file. Again, how to use strings with logic operators?
The other problem is using nested if statements. How to do it so I can use the algorithm above?
Thank you for your time!
We may also do this with across i.e. loop across the columns that starts_with 'T', then inside the loop, get the column names (cur_column()), replace the substring 'T', with 'P', and get its value, do a logical comparison, convert to numeric index by adding 1 (as R indexing starts from 1) to replace the values in vector (c(-1, 1)) based on the position index, and finally do a rowSums on the across output to create the 'total_score' column
library(dplyr)
library(stringr)
df %>%
mutate(total_score = rowSums(across(starts_with('T'),
~ c(-1, 1)[1 + (. == get(str_replace(cur_column(), 'T', 'P')))])))
-output
# A tibble: 4 x 5
T1 T2 P1 P2 total_score
<chr> <chr> <chr> <chr> <dbl>
1 a b a a 0
2 a a a a 2
3 a a a b 0
4 b a b b 0
data
df <- structure(list(T1 = c("a", "a", "a", "b"), T2 = c("b", "a", "a",
"a"), P1 = c("a", "a", "a", "b"), P2 = c("a", "a", "b", "b")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
a case_when structure can be used for your wish. Since we don't know how your data structure is, I created a dummy data which represents yours;
library(dplyr)
set.seed(1453)
scores <- data.frame(T1=sample(1:5,size = 200,replace = T),
T2=sample(1:5,size = 200,replace = T),
T3=sample(1:5,size = 200,replace = T),
P1=sample(1:5,size = 200,replace = T),
P2=sample(1:5,size = 200,replace = T),
P3=sample(1:5,size = 200,replace = T),
score=sample(50:100,size = 200,replace = T))
scores2 <- scores %>%
mutate(new_score=case_when(T1==P1 ~ score + 1,
T2==P2 ~ score + 1,
T3==P3 ~ score + 1,
TRUE ~ score - 1))
scores2%>%
head
Note: TRUE, means otherwise;
output;
T1 T2 T3 P1 P2 P3 score new_score
<int> <int> <int> <int> <int> <int> <int> <dbl>
1 4 2 2 1 5 5 64 63
2 3 5 4 2 1 3 82 81
3 5 1 5 4 5 5 89 90
4 2 5 3 4 5 1 62 63
5 3 5 4 3 2 4 53 54
6 3 1 4 1 3 2 82 81
If I got the problem right, each row is an observation, so I will compare each T column with the respective P column, than create a score for each comparison, finally I can sum them for each row.
Libraries
library(tidyverse)
Example Data
df <-
tibble(
T1 = c("a","a","a","b"),
T2 = c("b","a","a","a"),
P1 = c("a","a","a","b"),
P2 = c("a","a","b","b")
)
Code
df %>%
mutate(
S1 = if_else(T1 == P1, 1,-1),
S2 = if_else(T2 == P2, 1,-1)
) %>%
rowwise() %>%
mutate(total_score = sum(c_across(starts_with("S"))))
Output
# A tibble: 4 x 7
# Rowwise:
T1 T2 P1 P2 S1 S2 total_score
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 a b a a 1 -1 0
2 a a a a 1 1 2
3 a a a b 1 -1 0
4 b a b b 1 -1 0

Dynamically create value labels with haven::labelled, follow-up

Follow-up question to Dynamically create value labels with haven::labelled, where akrun provided a good answer using deframe.
I am using haven::labelled to set value labels of a variable. The goal is to create a fully documented dataset I can export to SPSS.
Now, say I have a df value_labels of values and their value labels. I also have a df df_data with variables to which I want allocate value labels.
value_labels <- tibble(
value = c(seq(1:6), seq(1:3), NA),
labels = c(paste0("value", 1:6),paste0("value", 1:3), NA),
name = c(rep("var1", 6), rep("var2", 3), "var3")
)
df_data <- tibble(
id = 1:10,
var1 = floor(runif(10, 1, 7)),
var2 = floor(runif(10, 1, 4)),
var3 = rep("string", 10)
)
Manually, I would create value labels for df_data$var1 and df_data$var2 like so:
df_data$var1 <- haven::labelled(df_data$var, labels = c(values1 = 1, values2 = 2, values3 = 3, values4 = 4, values5 = 5, values6 = 6))
df_data$var2 <- haven::labelled(df_data$var, labels = c(values1 = 1, values2 = 2, values3 = 3))
I need a more dynamic way of assigning correct value labels to the correct variable in a large dataset. The solution also needs to ignore character vectors, since I dont want these to have value labels. For that reason, var3 in value_labels is listed as NA.
The solution does not need to work with multiple datasets in a list.
Here is one option where we split the named 'value/labels' by 'name' after removing the NA rows, use the names of the list to subset the columns of 'df_data', apply the labelled and assign it to back to the same columns
lbls2 <- na.omit(value_labels)
lstLbls <- with(lbls2, split(setNames(value, labels), name))
df_data[names(lstLbls)] <- Map(haven::labelled,
df_data[names(lstLbls)], labels = lstLbls)
df_data
# A tibble: 10 x 4
# id var1 var2 var3
# <int> <dbl+lbl> <dbl+lbl> <chr>
# 1 1 2 [value2] 2 [value2] string
# 2 2 5 [value5] 2 [value2] string
# 3 3 4 [value4] 1 [value1] string
# 4 4 1 [value1] 2 [value2] string
# 5 5 1 [value1] 1 [value1] string
# 6 6 6 [value6] 2 [value2] string
# 7 7 1 [value1] 3 [value3] string
# 8 8 1 [value1] 1 [value1] string
# 9 9 3 [value3] 3 [value3] string
#10 10 6 [value6] 1 [value1] string

R : how to control behaviour of edges in ggraph

I'm facing this issue: I got some data like these:
library(tidyverse)
library(tidygraph)
library(ggraph)
library(ggrepel)
edges <- data.frame(a=c('k','k','k','k','k','z','z'),
b=c('b','b','b','b','c','b','c'), costant = 1)
a b costant
1 k b 1
2 k b 1
3 k b 1
4 k b 1
5 k c 1
6 z b 1
7 z c 1
Now I would lik to have a graph with ggraph that have nodes and edges with weights. So I worked this way:
# first I calculated the edges weights
edges1 <- edges%>% group_by(a,b) %>% summarise(weight = sum(costant))
> edges1
# A tibble: 4 x 3
# Groups: a [?]
a b weight
<fct> <fct> <dbl>
1 k b 4
2 k c 1
3 z b 1
4 z c 1
Then the nodes:
nodes <- rbind(data.frame(word = edges$a, n = 1),data.frame(word = edges$b, n = 1)) %>%
group_by(word) %>%
summarise(n = sum(n))
> nodes
# A tibble: 4 x 2
word n
<fct> <dbl>
1 k 5
2 z 2
3 b 5
4 c 2
Till now, everything works fine. Now, following this as example:
tidy <- tbl_graph(nodes = nodes, edges = edges1, directed = T)
tidy <- tidy %>%
activate(edges) %>%
arrange(desc(weight)
)
Suddently I plotted the graph:
ggraph(tidy, layout = "gem") +
geom_node_point(aes(size=n)) +
geom_edge_link(aes(width = weight), alpha = 0.8) +
scale_edge_width(range = c(0.2, 2)) +
geom_text_repel(aes(x = x, y=y , label=word))
But the result is this:
And I cannot figure out why there is a line between k and z, because that edges does not exists.
Thank in advance.
It seems it's due to the fact that tbl_graph converts edge1 tibble's nodes from factor to integer by as.integer without considering the nodes tibble, this is source of the error. If we pre-convert the edge node's to integers correctly it will work as expected.
edges <- data.frame(a=c('k','k','k','k','k','z','z'),
b=c('b','b','b','b','c','b','c'), costant = 1)
edges1 <- edges%>% group_by(a,b) %>% summarise(weight = sum(costant))
nodes <- rbind(data.frame(word = edges$a, n = 1),data.frame(word = edges$b, n = 1)) %>%
group_by(word) %>%
summarise(n = sum(n))
edges2 <- edges1 # save edges with factor node labels into edge2
# convert 'from' and 'to' factor columns to integer columns correctly
# with the nodes tibble's corresponding matched index values
edges1$a <- match(edges1$a, nodes$word)
edges1$b <- match(edges1$b, nodes$word)
tidy <- tbl_graph(nodes = nodes, edges = edges1, directed = T)
tidy <- tidy %>%
activate(edges) %>%
arrange(desc(weight)
)
ggraph(tidy, layout = "gem") +
geom_node_point(aes(size=n)) +
geom_edge_link(aes(width = weight), arrow = arrow(length = unit(4, 'mm')), end_cap = circle(3, 'mm'), alpha = 0.8) +
scale_edge_width(range = c(0.2, 2)) +
geom_text_repel(aes(x = x, y=y , label=word))
edges2 # compare the edges in the following tibble with the next figure
# A tibble: 4 x 3
# Groups: a [?]
a b weight
<fct> <fct> <dbl>
#1 k b 4
#2 k c 1
#3 z b 1
#4 z c 1

How to create a rank for a variable in a longitudinal dataset based on a condition?

I have a longitudinal dataset where each subject is represented more than once. One represents one admission for a patient. Each admission, regardless of the subject also has a unique "key". I need to figure out which admission is the "INDEX" admission, that is, the first admission, so that I know that which rows are the subsequent RE-admission. The variable to use is "Daystoevent"; the lowest number represents the INDEX admission. I want to create a new variable based on the condition that for each subject, the lowest number in the variable "Daystoevent" is the "index" admission and each subsequent gets a number "1" , "2" etc. I want to do this WITHOUT changing into the horizontal format.
The dataset looks like this:
Subject Daystoevent Key
A 5 rtwe
A 8 erer
B 3 tter
B 8 qgfb
A 2 sada
C 4 ccfw
D 7 mjhr
B 4 sdfw
C 1 srtg
C 2 xcvs
D 3 muyg
Would appreciate some help.
This may not be an elegant solution but will do the job:
library(dplyr)
df <- df %>%
group_by(Subject) %>%
arrange(Subject, Daystoevent) %>%
mutate(
Admission = if_else(Daystoevent == min(Daystoevent), 0, 1),
) %>%
ungroup()
for(i in 1:(nrow(df) - 1)) {
if(df$Admission[i] == 1) {
df$Admission[i + 1] <- 2
} else if(df$Admission[i + 1] != 0){
df$Admission[i + 1] <- df$Admission[i] + 1
}
}
df[df == 0] <- "index"
df
# # A tibble: 11 x 4
# Subject Daystoevent Key Admission
# <chr> <dbl> <chr> <chr>
# 1 A 2 sada index
# 2 A 5 rtwe 1
# 3 A 8 erer 2
# 4 B 3 tter index
# 5 B 4 sdfw 1
# 6 B 8 qgfb 2
# 7 C 1 srtg index
# 8 C 2 xcvs 1
# 9 C 4 ccfw 2
# 10 D 3 muyg index
# 11 D 7 mjhr 1
Data:
df <- data_frame(
Subject = c("A", "A", "B", "B", "A", "C", "D", "B", "C", "C", "D"),
Daystoevent = c(5, 8, 3, 8, 2, 4, 7, 4, 1, 2, 3),
Key = c("rtwe", "erer", "tter", "qgfb", "sada", "ccfw", "mjhr", "sdfw", "srtg", "xcvs", "muyg")
)

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.
Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

Resources