Creating iGraph object by matching ID numbers - r

I am seeking a solution to an issue relating to the creation of an igraph object for my masters thesis. I need to conduct a network analysis for database which matches participant by project ID - there are about 1500 total rows consisting of about 600 or so different participants, and about 300 or so projects. Multiple participants are involved in multiple different projects. A sample of the code is below
Participant_ID Project_Number
1 1 101314
2 2 101314
3 3 101314
4 1 101314
5 5 101346
6 6 101346
7 7 101346
8 8 101531
9 9 101531
10 3 101531
11 11 101533
12 8 101533
13 3 101533
14 14 101533
15 9 101612
How can I match these in a way that creates a network? How could I code these into an edge list - whether it be via a matrix or otherwise, as there are so many observations and differing numbers of participants on each project. I'm finding it hard to find much online that shows how I can create an edge list by ID numbers, and I need to overcome what I imagine is a fairly ordinary few lines of code.

Like this? Project_Number must be converted into a factor to prevent the creation of a graph with 100k+ nodes:
library(tidyverse)
library(tidygraph)
#>
#> Attaching package: 'tidygraph'
#> The following object is masked from 'package:stats':
#>
#> filter
graph <-
tribble(
~Participant_ID, ~Project_Number,
1, 101314,
2, 101314,
3, 101314,
1, 101314
) %>%
mutate(Project_Number = Project_Number %>% factor()) %>%
tbl_graph(edges = .)
graph
#> # A tbl_graph: 4 nodes and 4 edges
#> #
#> # A directed acyclic multigraph with 1 component
#> #
#> # Node Data: 4 × 1 (active)
#> name
#> <chr>
#> 1 1
#> 2 101314
#> 3 2
#> 4 3
#> #
#> # Edge Data: 4 × 2
#> from to
#> <int> <int>
#> 1 1 2
#> 2 3 2
#> 3 4 2
#> # … with 1 more row
graph %>% as.igraph() %>% plot()
Created on 2021-09-10 by the reprex package (v2.0.1)

Maybe a bipartite graph could help
graph_from_data_frame(df) %>%
set_vertex_attr(name = "type", value = names(V(.)) %in% df$Participant_ID) %>%
plot(layout = layout_as_bipartite)
If you want a matrix output, you can try
> table(df)
Project_Number
Participant_ID 101314 101346 101531 101533 101612
1 2 0 0 0 0
2 1 0 0 0 0
3 1 0 1 1 0
5 0 1 0 0 0
6 0 1 0 0 0
7 0 1 0 0 0
8 0 0 1 1 0
9 0 0 1 0 1
11 0 0 0 1 0
14 0 0 0 1 0

Related

New column which counts the number of times a value in a specific row of one column appears in another column

I have tried searching for an answer to this question but it continues to elude me! I am working with crime data where each row refers to a specific crime incident. There is a variable for suspect ID, and a variable for victim ID. These ID numbers are consistent across the two columns (in other words, if a row contains the ID 424 in the victim ID column, and a separate row contains the ID 424 in the suspect column, I know that the same person was listed as a victim in the first crime and as a suspect in the second crime).
I want to create two new variables: one which counts the number of times the victim (in a particular crime incident) has been recorded as a suspect (in the dataset as a whole), and one which counts the number of times the suspect (in a particular crime incident) has been recorded as a victim (in the dataset as a whole).
Here's a simplified version of my data:
s.uid
v.uid
1
1
9
2
2
8
3
3
2
4
4
2
5
5
2
6
NA
7
7
5
6
8
9
5
And here is what I want to create:
s.uid
v.uid
s.in.v
v.in.s
1
1
9
0
1
2
2
8
3
0
3
3
2
0
1
4
4
2
0
1
5
5
2
1
1
6
NA
7
NA
0
7
5
6
1
0
8
9
5
1
2
Note that, where there is an NA, I would like the NA to be preserved. I'm currently trying to work in tidyverse and piping where possible, so I would prefer answers in that kind of format, but I'm open to any solution!
Using dplyr:
dat %>%
group_by(s.uid) %>%
mutate(s.in.v = sum(dat$v.uid %in% s.uid)) %>%
group_by(v.uid) %>%
mutate(v.in.s = sum(dat$s.uid %in% v.uid))
# A tibble: 8 × 4
# Groups: v.uid [6]
s.uid v.uid s.in.v v.in.s
<int> <int> <int> <int>
1 1 9 0 1
2 2 8 3 0
3 3 2 0 1
4 4 2 0 1
5 5 2 1 1
6 NA 7 0 0
7 5 6 1 0
8 9 5 1 2
First, a reprex of your data:
library(tidyverse)
# Replica of your data:
s.uid <- c(1:5, NA, 5, 9)
v.uid <- c(9, 8, 2, 2, 2, 7, 6, 5)
DF <- tibble(s.uid, v.uid)
Custom function to use:
# function to check how many times "a" (a length 1 atomic vector) occurs in "b":
f <- function(a, b) {
a <- as.character(a)
# make a lookup table a.k.a dictionary of values in b:
b_freq <- table(b, useNA = "always")
# if a is in b, return it's frequency:
if (a %in% names(b_freq)) {
return(b_freq[a])
}
# else (ie. a is not in b) return 0:
return(0)
}
# vectorise that, enabling intake of any length of "a":
ff <- function(a, b) {
purrr::map_dbl(.x = a, .f = f, b = b)
}
Finally:
DF |>
mutate(
s_in_v = ff(s.uid, v.uid),
v_in_s = ff(v.uid, s.uid)
)
Results in:
#> # A tibble: 8 × 4
#> s.uid v.uid s_in_v v_in_s
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 9 0 1
#> 2 2 8 3 0
#> 3 3 2 0 1
#> 4 4 2 0 1
#> 5 5 2 1 1
#> 6 NA 7 NA 0
#> 7 5 6 1 0
#> 8 9 5 1 2

Simple operation with lagged values

I need to calculate line-wise simple operations using lagged values, for example the sum for a variable for the previous x years
I tried:
toy %>%
group_by(student) %>%
mutate(lag_passed = sum(lag(passed, n = 5, order_by = year, default = 0)))
toy %>%
group_by(student) %>%
arrange(year) %>%
mutate(lag_passed = lapply(passed, function(x) sum(lag(x, n = 5, default = 0))))
Reproducible examples. Task sum the number of passed tests in the previous five years.
toy <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,0,0,1))
student year passed
1 A 1 0
2 A 2 0
3 A 3 0
4 A 4 1
5 A 5 2
6 A 6 0
7 A 7 0
8 A 8 0
9 A 9 0
10 A 10 1
expected <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,1,0,1),
lag_passed=c(0,0,0,0,1,3,3,3,4,3))
student year passed lag_passed
1 A 1 0 0
2 A 2 0 0
3 A 3 0 0
4 A 4 1 0
5 A 5 2 1
6 A 6 0 3
7 A 7 0 3
8 A 8 1 3
9 A 9 0 4
10 A 10 1 3
runner::sum_run() will help here. using idx = year is optional, unless you have missing values in some of the years, in which case it will take into account those missing years too, which is however, not the case with sample data. grouping on student is added because, in actual you may want to carry out the operation for each student.
toy <- data.frame(student = rep("A",10),
year=c(1:10),
passed=c(0,0,0,1,2,0,0,1,0,1))
library(dplyr)
library(runner)
toy %>% group_by(student) %>%
mutate(lag_passed = sum_run(x = passed,
idx = year,
k = 5,
lag = 1))
#> # A tibble: 10 x 4
#> # Groups: student [1]
#> student year passed lag_passed
#> <chr> <int> <dbl> <dbl>
#> 1 A 1 0 NA
#> 2 A 2 0 0
#> 3 A 3 0 0
#> 4 A 4 1 0
#> 5 A 5 2 1
#> 6 A 6 0 3
#> 7 A 7 0 3
#> 8 A 8 1 3
#> 9 A 9 0 4
#> 10 A 10 1 3
Created on 2021-05-15 by the reprex package (v2.0.0)
Another rolling sum solution with zoo::rollapply:
f <- function(x) {zoo::rollapply(x, 6, sum, align = 'right', partial = TRUE) - x}
expected %>%
group_by(student) %>%
arrange(year) %>%
mutate(lag_passed2 = f(passed)) %>%
ungroup()
# student year passed lag_passed lag_passed2
# <chr> <int> <dbl> <dbl> <dbl>
# 1 A 1 0 0 0
# 2 A 2 0 0 0
# 3 A 3 0 0 0
# 4 A 4 1 0 0
# 5 A 5 2 1 1
# 6 A 6 0 3 3
# 7 A 7 0 3 3
# 8 A 8 1 3 3
# 9 A 9 0 4 4
# 10 A 10 1 3 3
lag_passed2 created with the helper function is the same as lag_passed. The idea is to calculate a sliding window sum with a window length of 6 (allow partial window at begining by partial = T and align = 'right'), then substract the passed value of the current years.
Note: the helper function f can be replaced to a simpler one by specifying the window using offsets and default right alignment as pointed out by #G. Grothendieck:
f <- function(x) rollapplyr(x, list(-seq(5)), sum, partial = TRUE, fill = 0)

Adding sequential IDs to rows in data frame

I have a dataset called Snapper_new that has 330 rows and each set of nine rows is named 1 through 9 as shown in the id column. I want each set of nine rows (1-9, 10-18, etc.) to have a unique ID (1,2, etc.). How would I do this in R?
Here an approach with the tidyverse
library(tidyverse)
Snapper_new <- rep(seq(1:9), 3) %>%
enframe(name=NULL, value="id")
Snapper_new %>%
mutate(group_start=case_when(id==1 ~ 1,
TRUE ~ as.numeric(0))) %>%
mutate(group_index=cumsum(group_start))
#> # A tibble: 27 x 3
#> id group_start group_index
#> <int> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 0 1
#> 3 3 0 1
#> 4 4 0 1
#> 5 5 0 1
#> 6 6 0 1
#> 7 7 0 1
#> 8 8 0 1
#> 9 9 0 1
#> 10 1 1 2
#> # ... with 17 more rows
Created on 2020-11-30 by the reprex package (v0.3.0)
Pure R answer.
a = data.frame("test"=1:330, "pokus" = 1:330)
b <- unlist(lapply(1:ceiling(330/9), function(x) {replicate(9, x)}))
b <- b[1:nrow(a)]
a <- cbind(a, b)

Unable to get 0.00% for cells with 0 observations

I have a dataset which looks like:
ID week status
20 0 2
20 1 2
20 2 2
20 3 2
20 4 3
I need the proportion of status by week.
So I used the code
g_young= dat_young%>%group_by(week)%>%count(status)%>%mutate(dist=prop.table(n)*100)
I get the answer all right, but, the issue is that cells where the observation is 0, R is not showing the percentage for those as 0.00.
For example:
week status n dist
0 1 1957 12.9
0 3 1301 86.4
0 5 90 0.59
In normal situation this would not have been an issue, but, I need to make a graph after this and the fact that there is no value for status 2 and4 in the above table is causing a weird step like function in the graph. Any ideas, how I could sort this out?
Thanks a lot. Appreciate the time and effort in helping me with a solution.
You would need to convert your status variable into a factor type and then add the argument .drop = FALSE to the count() function.
For example:
suppressMessages(library(dplyr))
dat <- tibble(week = c(0,0,0,0,0,1,1,1,1,2),
status = c(1,1,2,1,1,2,3,1,2,1))
dat
#> # A tibble: 10 x 2
#> week status
#> <dbl> <dbl>
#> 1 0 1
#> 2 0 1
#> 3 0 2
#> 4 0 1
#> 5 0 1
#> 6 1 2
#> 7 1 3
#> 8 1 1
#> 9 1 2
#> 10 2 1
dat %>%
mutate(status = factor(status)) %>%
group_by(week) %>%
count(status, .drop = FALSE) %>%
mutate(dist = prop.table(n)*100)
#> # A tibble: 9 x 4
#> # Groups: week [3]
#> week status n dist
#> <dbl> <fct> <int> <dbl>
#> 1 0 1 4 80
#> 2 0 2 1 20
#> 3 0 3 0 0
#> 4 1 1 1 25
#> 5 1 2 2 50
#> 6 1 3 1 25
#> 7 2 1 1 100
#> 8 2 2 0 0
#> 9 2 3 0 0
Created on 2020-10-12 by the reprex package (v0.3.0)
rdplyr

Is there a way in R to calculate how many times it happens that if there is a 0, the subsequent number will be also a zero?

I have a binary dataset with 0 and 1. Is there a way to calculate how many times it happens that if there is a 0, the subsequent number will be also a zero (so 1010100). And how many times it happens that there are three subsequent zeros, etc.
A zero in my data means that a bird is not on its nest at a particular time. For every three minutes, it is noted if the bird is on its nest (1) or not (0). My goal is to find out if birds in particular areas leave their nests longer because it takes them longer to get enough food. Longer means more subsequent zeros.
bla <- rle(c(0,1,0,0,1,0,0,0,1,0,0))
table(bla$lengths[bla$values == 0])
gives you
1 2 3
1 2 1
meaning that a sequence of 1 zero appears once, 2 zeros twice and 3 zeros again once.
You can use run length encoding to reconstruct this data. Suppose your data looked like this:
set.seed(69)
df <- data.frame(time = seq(1:20), on_off = rbinom(20, 1, 0.5))
df
#> time on_off
#> 1 1 1
#> 2 2 1
#> 3 3 1
#> 4 4 1
#> 5 5 0
#> 6 6 1
#> 7 7 0
#> 8 8 1
#> 9 9 0
#> 10 10 0
#> 11 11 1
#> 12 12 0
#> 13 13 1
#> 14 14 0
#> 15 15 0
#> 16 16 0
#> 17 17 0
#> 18 18 1
#> 19 19 0
#> 20 20 1
Then you can convert it into a run-length encoded data frame like this:
RLE <- rle(df$on_off)
len <- RLE$lengths
new_df <- data.frame(time = df$time[cumsum(c(1, len))[seq_along(len)]],
on_off = RLE$value, duration = len)
new_df
#> time on_off duration
#> 1 1 1 4
#> 2 5 0 1
#> 3 6 1 1
#> 4 7 0 1
#> 5 8 1 1
#> 6 9 0 2
#> 7 11 1 1
#> 8 12 0 1
#> 9 13 1 1
#> 10 14 0 4
#> 11 18 1 1
#> 12 19 0 1
#> 13 20 1 1
Created on 2020-07-01 by the reprex package (v0.3.0)

Resources