Automating a loop in igraph using tidygraph - r

Hello and hope all goes well.
I made an edit to my previous question and hope it makes it more clear.
I created an igraph object and would like to run same analysis several times and extract some information in each iteration.
I can't share the whole data, so I am sharing just a small subset.
df_edge is as follows:
library(dplyr)
job_1 <-c(1,2,6,6,5,6,7,8,6,8,8,6,6,8)
job_2 <- c(2,4,5,8,3,1,4,6,1,7,3,2,4,5)
weight <- c(1,1,1,2,1,1,2,1,1,1,2,1,1,1)
df_edge <- tibble(job_1,job_2,weight)
df_edge %>% glimpse()
Rows: 14
Columns: 3
$ job_1 <dbl> 1, 2, 6, 6, 5, 6, 7, 8, 6, 8, 8, 6, 6, 8
$ job_2 <dbl> 2, 4, 5, 8, 3, 1, 4, 6, 1, 7, 3, 2, 4, 5
$ weight <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1
df_node is as follows:
job_id <- c(1,2,3,4,5,6,7,8)
job_type <- c(1,2,0,0,3,1,1,1)
df_node <- tibble(job_id,job_type)
df_node %>% glimpse()
Rows: 8
Columns: 2
$ job_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8
$ job_type <dbl> 1, 2, 0, 0, 3, 1, 1, 1
Creating the igraph object:
library(igraph)
library(tidygraph)
tp_network_subset <- graph.data.frame(df_edge,vertices = df_node,directed = F)
summary of job_type column in the df_node
df_node %>%
count(job_type)
A tibble: 4 x 2
job_type n
<dbl> <int>
1 0 2
2 1 4
3 2 1
4 3 1
What I am doing manually is the following:
### finding a job_id that belongs to job_type==1 category
df_node %>% filter(job_type==1) %>%
select(job_id)
A tibble: 4 x 1
job_id
<dbl>
1 1
2 6
3 7
4 8
# for instance, I picked one of them and it is job_id = 6
### using the job_id to create a subgraph by selecting order 1 neighbors of this job_id (6)
node_test <- make_ego_graph(tp_network_subset,order = 1 ,nodes="6")
### creating a dataframe of this subgrapgh where there is no isolated nodes
df_test <- as_tbl_graph(node_test[[1]]) %>%
activate(nodes) %>%
filter(!node_is_isolated()) %>%
as_tibble()
df_test %>% glimpse()
Rows: 6
Columns: 2
$ name <chr> "1", "2", "4", "5", "6", "8"
$ job_type <dbl> 1, 2, 0, 3, 1, 1
## subgraph size is 6 which will be an outcome of interest
### if the graph is zero length , I should stop here and pick another job_id that belongs to job_type==1 category
In this example, the graph in not zero length so I proceed to the next step
### calculating the measure of interest in respect to job_type==1 category
df_test %>%
summarise(job_rate= (nrow(df_test %>% filter(job_type==1)))/(nrow(df_test %>%
filter(job_type %in% c(1,2,3)))))
# 0.6
if job_rate > 0.5 , I want to keep the job_rate and rows (corresponding nodes) of the job_type=4 category of the subgraph. in this instance, job_rate was 0.6 so I am keeping the following
df_final <- as_tbl_graph(node_test[[1]]) %>%
activate(nodes) %>%
filter(!node_is_isolated()) %>%
as_tibble() %>% filter(job_type==0)
# A tibble: 1 x 2
name job_type
<chr> <dbl>
1 4 0
But, I need to assign their corresponding job__rate and some other related columns. So, my favorite outcome would be
name job_type subgraph_origin_id job_rate subgraph_size no_(job_type==0)_in_subgrapgh no_(job_type==1)_in_subgrapgh no_(job_type==2)_in_subgrapgh no_(job_type==3)_in_subgrapgh
<chr> <dbl>
1 4 0 6 0.6 6
so, I need to do this process and create subgrapghs for all job_type==1 nodes. If the grapgh is not zero length and its job_rate > 0.5 then extract all the corresponding nodes in that subgrapgh along with the job_rate and other columns shown in the favorite outcome.

Does this work for you?
dflst <- split(df_node, job_type)
tpe <- as.numeric(names(dflst))
out <- tibble()
for (i in seq_along(dflst)) {
df <- dflst[[i]]
node_test_lst <- make_ego_graph(tp_network_subset, order = 1, nodes = df$job_id)
origin_id <- df$job_id
jtpe <- tpe[i]
for (j in seq_along(node_test_lst)) {
node_test <- node_test_lst[[j]]
df_test <- as_tbl_graph(node_test) %>%
activate(nodes) %>%
filter(!node_is_isolated()) %>%
as_tibble()
if (nrow(df_test %>% filter(job_type == 0)) > 0 & any(df_test$job_type %in% 1:3)) {
job_rate <- with(df_test, sum(job_type == jtpe) / sum(job_type %in% 1:3))
if (job_rate > 0.5) {
df_final <- df_test %>%
filter(job_type == 0) %>%
mutate(
subgraph_origin_id = origin_id[j],
job_rate = job_rate,
subgraph_size = nrow(df_test)
) %>%
cbind(
setNames(
as.list(table(factor(df_test$job_type, levels = 0:3))),
sprintf("no_(job_type==%s)_in_subgrapgh", 0:3)
)
)
out <- out %>% rbind(df_final)
}
}
}
}
which gives
> out
name job_type subgraph_origin_id job_rate subgraph_size
1 4 0 6 0.60 6
2 4 0 7 1.00 3
3 3 0 8 0.75 5
no_(job_type==0)_in_subgrapgh no_(job_type==1)_in_subgrapgh
1 1 3
2 1 2
3 1 3
no_(job_type==2)_in_subgrapgh no_(job_type==3)_in_subgrapgh
1 1 1
2 0 0
3 0 1

Related

Checking if columns in dataframe are "paired"

I have a very long data frame (~10,000 rows), in which two of the columns look something like this.
A B
1 5.5
1 5.5
2 201
9 18
9 18
2 201
9 18
... ...
Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?
You want to know if value x in column A always means value y in column B?
Let's group by A and count the distinct values in B:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 1
2 2 1
3 9 1
If we now alter the df to the case that this is not true:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 2
2 2 1
3 9 1
Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)
If you run this you will see how many unique values of B there are for each value of A
tapply(dat$B, dat$A, function(x) length(unique(x)))
So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.

slice lowest positive value in R

I have a dataset looking like this:
df <- data.frame(ID=c(1, 1, 1, 2, 2, 3), days=c(100, 10, -8, -5, 12, 10))
Now I only want to take the lowest positive value of "days" so that my output would look like this:
new <- data.frame(ID=c(1, 2, 3), days=c(10, 12, 10))
I have thought about this:
df%>%
group_by(ID)%>%
slice_min(days)
But of course this will return the lowest number number also if it is negative. What can I do to only get the lowest positive values?
Preferably using dplyr.
Thanks so much!
filtering only positve values for days should do.
df <- data.frame(ID=c(1, 1, 1, 2, 2, 3), days=c(100, 10, -8, -5, 12, 10))
library(dplyr)
df %>%
group_by(ID) %>%
filter(days>0) %>%
slice_min(days)
#> # A tibble: 3 x 2
#> # Groups: ID [3]
#> ID days
#> <dbl> <dbl>
#> 1 1 10
#> 2 2 12
#> 3 3 10
You can use aggregate()
aggregate(days ~ ID, df, function(x){
min(x[x > 0])
})
# ID days
# 1 1 10
# 2 2 12
# 3 3 10

How do I create a frequency table with empty categories in R?

I have a table with responses to multiple items in a survey.
(e.g. 1 = disagree and 7 agree)
var1 <- c(2, 2, 4, 1, 5, 3, 4, 6, 7, 7, 6)
var2 <- c(3, 4, 5, 1, 1, 2, 6, 6, 7, 1, 2)
var3 <- c(1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 1)
df <- cbind(var1, var2, var3)
To prepare for a plot, I would like to obtain a frequency table through:
frequenties <- df %>%
apply(2, table) %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(antwoord = rowname)
That works. However, if for some variables not all answer possibilities
are present I run into trouble.
In the example below, value 7 does not appear.
var3 <- c(1, 2, 3, 1, 2, 3, 4, 5, 6, 6, 1)
df <- cbind(var1, var2, var3)
If I run the same code:
frequenties <- df %>%
apply(2, table) %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(antwoord = rowname)
The error is:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 7, 6
I get the problem: the lists are different in length.
apply does not pick up the zero's and a a consequence the list of var3 is shorter.
I do not know how to solve this problem.
Is there a way to deal with empty categories?
Is there another way to make a frequency table. How?
One approach is to convert the variables to factors that include all 7 levels. Then the output of table will include all 7 possibilities:
library(dplyr)
library(purrr)
as.data.frame(df) %>%
mutate(across(starts_with("var"), ~factor(.,levels = 1:7))) %>%
map_dfc(table) %>%
rownames_to_column(var = "antwoord")
# A tibble: 7 x 4
antwoord var1 var2 var3
<chr> <table> <table> <table>
1 1 1 3 3
2 2 2 2 2
3 3 1 1 2
4 4 2 1 1
5 5 1 1 1
6 6 2 2 2
7 7 2 1 0
An alternative approach would be to pivot the data using tidyr::pivot_longer and then use dplyr::tally:
library(tidyr)
as.data.frame(df) %>%
pivot_longer(cols = everything(), values_to = "antwoord") %>%
group_by(name,antwoord) %>%
tally %>%
pivot_wider(names_from = "name", values_from = n, values_fill = 0)
# A tibble: 7 x 4
antwoord var1 var2 var3
<dbl> <int> <int> <int>
1 1 1 3 3
2 2 2 2 2
3 3 1 1 2
4 4 2 1 1
5 5 1 1 1
6 6 2 2 2
7 7 2 1 0
Another option is to pivot to long format with pivot_longer, use count and reshape back to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything(), values_to = 'antwoord') %>%
count(name, antwoord) %>%
pivot_wider(names_from = name, values_from = n)
data
df <- data.frame(var1, var2, var3)

Sum rows with specific criteria in r

I am having data as
function person
1 hr 1
2 sls 5
3 mktg 3
4 qlt 7
5 rev 5
I want to make a row with sum of value in column "function" as "sls" and "mktg" using r programing
desired output is :
Person function
1 1 hr
2 8 sls & mktg
3 7 qlt
4 5 rev
A base R solution:
merg <- c("sls", "mktg")
dat$func[dat$func %in% merg] <- paste(merg, collapse = " & ")
aggregate(person ~ func, dat, sum)
func person
1 hr 1
2 qlt 7
3 rev 5
4 sls & mktg 8
Data
dat <- data.frame(
func = c("hr", "sls", "mktg", "qlt", "rev"),
person = c(1, 5, 3, 7, 5),
stringsAsFactors = FALSE
)
Note that this assumes dat$func is a character... if it is not first convert to character with as.character()
library(dplyr)
dat <- data.frame(func = c("hr", "sls", "mktg", "qlt", "rev"),
person = c(1, 5, 3, 7, 5))
dat %>%
mutate(func = func %>% as.factor() %>% as.character(),
func = ifelse(func %in% c("sls", "mktg"), "sls & mktg", func)) %>%
group_by(func) %>%
summarize(Person = sum(person))
returns
# A tibble: 4 x 2
func Person
<chr> <dbl>
1 hr 1
2 qlt 7
3 rev 5
4 sls & mktg 8
Another approach with dplyr:
Code:
dfr %>%
group_by(Function = sub("sls|mktg", "sls & mktg", functn)) %>%
summarise(Person = sum(person))
Output:
# A tibble: 4 x 2
Function Person
<chr> <dbl>
1 hr 1.
2 qlt 7.
3 rev 5.
4 sls & mktg 8.
Data
tringsAsFactors = TRUE|FALSE - works in both cases.
dfr <- data.frame(
functn = c("hr", "sls", "mktg", "qlt", "rev"),
person = c(1, 5, 3, 7, 5)
)

Pass a vector with names to mutate to create multiple new columns

I'm trying to recode answers using a vector that contains the correct answers. I made a for loop that create a new column (with the coded answer) at each loop using a vector with the possible names for the new columns.
However, it seems that mutate does not receive vectors with names. I've tried some different vectors and some paste0() combinations but nothing seem to work.
Here is my reproduceable code:
library(dplyr)
library(tibble)
correct = c(4, 5, 2, 2, 2, 3, 3, 5, 4, 5, 2, 1, 3, 4, 2, 2, 2, 4, 3, 1, 1, 5, 4, 1, 3, 2)
sub1 = c(3, 5, 1, 5, 4, 3, 2, 5, 4, 3, 4, 4, 4, 1, 5, 1, 4, 3, 3, 4, 3, 2, 4, 2, 3, 4)
df = t(data.frame(sub1))
colnames(df) = paste0("P", 1:26)
new_names = paste0("P", 1:26, "_coded")
for(i in 1:26){
df = as.tibble(df) %>%
mutate(new_names = case_when(.[i] == correct[i] ~ 1,
.[i] != correct[i] ~ 0,
T ~ 9999999))
print(df) # to know what's going on.
}
Also, I know that .dots can receive names in a vector (I think), but I don't quite understand how to use it with case_when inside mutate().
Others ways to create new columns with the recoded value are welcome also
UPDATE:
My expected output would be the original data frame with 26 new columns, P1_COD:P26_COD with possible values 1 (if correct) and 0 (if incorrect).
Something like this (I just created four columns with 1s and 0s as an example).
df %>%
mutate(P1_COD = 1,
P2_COD = 0,
P3_COD = 1,
P4_COD = 1)
The data is not in a format that dplyr will handle best. I would suggest restructuring your data to longitudinal format, and then the case_when becomes trivial and no for loop is required.
see other documentation for tidyr regarding data format at tidyverse.org documentation
Here is an example of the "longitudinal" format including your sample data. I also added a couple of other subjects with random answers.
library(tidyverse)
responses <- data_frame(
subject = rep(1:3, each = 26),
qNum = rep(1:26, 3),
response = c(sub1,
sample(5, 26, replace = T),
sample(5, 26, replace = T)))
The answers can be created and then merged:
answers <- data_frame(
qNum = 1:26,
answer = correct)
df <- left_join(responses, answers)
Next, score the answers using dplyr::case_when:
df <- df %>% mutate(score = case_when(response == answer ~ 1,
TRUE ~ 0))
note: the TRUE ~ 0 may be confusing at first. It tells what to do with the remaining values, if the first condition is FALSE. The resulting df/tibble:
# A tibble: 26 x 5
subject qNum response answer score
<dbl> <int> <dbl> <dbl> <dbl>
1 1 1 3 4 0
2 1 2 5 5 1
3 1 3 1 2 0
4 1 4 5 2 0
5 1 5 4 2 0
6 1 6 3 3 1
7 1 7 2 3 0
8 1 8 5 5 1
9 1 9 4 4 1
10 1 10 3 5 0
# ... with 16 more rows
If you want to convert this to "wide" format, use tidyr::spread:
df %>%
select(-response, -answer) %>%
spread(qNum, score, sep = ".")
# A tibble: 3 x 27
subject qNum.1 qNum.2 qNum.3 qNum.4 qNum.5 qNum.6 qNum.7 qNum.8 qNum.9 qNum.10
* <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0 0 1 0 1 1 0
2 2 0 0 0 0 1 0 0 0 0 0
3 3 0 0 0 0 1 0 0 0 0 0

Resources