Finding indirect nodes for every edge (in R) - r

I have information on groups of physicians working together in given hospitals. A physician can work in more than one hospital at the same time. I would like to write a code that outputs information of all indirect colleagues of a given physician working in a given hospital. For instance, if I work in a given hospital with another physician who also works in another hospital, I would like to know who are the physicians with whom my colleague works in this other hospital.
Consider a simple example of three hospitals (1, 2, 3) and five physicians (A, B, C, D, E). Physicians A, B and C work together in hospital 1. Physicians A, B and D work together in hospital 2. Physicians B and E work together in hospital 3.
For each physician working in a given hospital I would like information of their indirect colleagues through each of their direct colleagues. For example, physician A has one indirect colleague through physician B in hospital 1: this is physician E in hospital 3. On the other hand, physician B does not have any indirect colleague through physician A in hospital 1. Physician C has two indirect colleagues through physician B in hospital 1: they are physician D in hospital 2 and physician E in hospital 3. And so on..
Below is the object that describes the nertworks of physicians in all hospitals:
edges <- tibble(hosp = c("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "3", "3"),
from = c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B", "D", "D", "B", "E"),
to = c("C", "B", "C", "A", "B", "A", "D", "B", "A", "D", "A", "B", "E", "B")) %>% arrange(hosp, from, to)
I would like a code that produces the following output:
output <- tibble(hosp = c("1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3"),
from = c("A", "A", "B", "B", "C", "C", "C", "A", "A", "B", "B", "D", "D", "D", "B", "E", "E", "E", "E"),
to = c("C", "B", "C", "A", "B", "A", "B", "D", "B", "A", "D", "A", "B", "B", "E", "B", "B", "B", "B"),
hosp_ind = c("" , "3", "" , "" , "2", "2", "3", "" , "3", "" , "" , "1", "1", "3", "" , "1", "1", "2", "2"),
to_ind = c("" , "E", "" , "" , "D", "D", "E", "" , "E", "" , "" , "C", "C", "E", "" , "A", "C", "A", "D")) %>% arrange(hosp, from, to)

Here is one option using igraph + data.table
library(igraph)
library(data.table)
g <- simplify(graph_from_data_frame(edges, directed = FALSE))
res <- setDT(edges)[
,
c(.SD, {
to_ind <- setdiff(
do.call(
setdiff,
Map(names, ego(g, 2, c(to, from), mindist = 2))
), from
)
if (!length(to_ind)) {
hosp_ind <- to_ind <- NA_character_
} else {
hosp_ind <- lapply(to_ind, function(v) names(neighbors(g, v)))
}
data.table(
hosp_ind = unlist(hosp_ind),
to_ind = rep(to_ind, lengths(hosp_ind))
)
}),
.(id = seq(nrow(edges)))
][, id := NULL][]
and you will obtain
> res
hosp from to hosp_ind to_ind
1: 1 A B 3 E
2: 1 A C <NA> <NA>
3: 1 B A <NA> <NA>
4: 1 B C <NA> <NA>
5: 1 C A 2 D
6: 1 C B 2 D
7: 1 C B 3 E
8: 2 A B 3 E
9: 2 A D <NA> <NA>
10: 2 B A <NA> <NA>
11: 2 B D <NA> <NA>
12: 2 D A 1 C
13: 2 D B 1 C
14: 2 D B 3 E
15: 3 B E <NA> <NA>
16: 3 E B 1 A
17: 3 E B 2 A
18: 3 E B 1 C
19: 3 E B 2 D
Also, when you run plot(g), you will see the graph like below

Related

How to count occurence of variable eacht time it occurs and remove outliers in R

I have a vector. On the hand I want to remove factors, which seem to be classified not correct. For instance the "D" at position 7. As the surroundings are "A" this should be "A" too. I know there must be a rule, for example, if the 3 values before and after an outlier are different it is converged- in this case "D" to "A" , otherwise it is removed like the "C" on position 22.
Var = c("A", "A", "A", "A","A", "A", "D", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "B", "B", "C","C","C","C","C","C","C","C","C","C","D", "D","D","D","D","D","D","D", "A", "A", "A", "A","A", "A", "A", "A", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C","C","C","C","C","C","C","C", "C","C","C","C","C","C","C","C", "D","D","D","D","D")
Var= as.factor(Var)
Var2=c("1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","1",
"1","1","1","1","1")
df<- data.frame (Var, Var2)
Additionally, I want to count the occurences for each variable, if it occurs. So I do nit want to count the occurences in the whole vector, but a list like this. Ideally with the corrected values.
# Var Occurence
#1 A 6
#2 D 1
#3 A 4
#4 B 10
#5 C 1
#6 B 2 ...
I only get to count the values for the whole vector to get with
table (Var)
By the following code I get a column, which start counting each time the "Var" changes.
df$Var <- with(df, ave(Var, FUN = function(x) sequence(rle(as.character(x))$lengths)))
This may be easier with data.table. Do a grouping by the rleid (run-length-id) of the 'Var', and get the count (.N), then remove the outlier observations by creating a logical expression in i (from the boxplot outliers)
library(data.table)
setDT(df)[, .N, .(Var, grp = rleid(Var))][, grp := NULL][
!N %in% boxplot(N, plot = FALSE)$out]
-output
Var N
1: A 6
2: D 1
3: A 4
4: B 10
5: C 1
6: B 2
7: C 10
8: D 8
9: A 12
10: B 12
11: C 16
12: D 5
rleid can take multiple input columns as the first argument is variadic (...) - from ?rleid
rleid(..., prefix=NULL)
... A sequence of numeric, integer64, character or logical vectors, all of same length. For interactive use.
Therefore, if we have multiple columns, either specify the columns or may use rleidv and the subset of data.frame/data.table as input
setDT(df)[, .N, .(Var, Var2, grp = rleid(Var, Var2))][,
grp := NULL][ !N %in% boxplot(N, plot = FALSE)$out]

How to pull the column indices when matching the rows of a dataframe and a vector

Say I have a dataframe of letters like so:
X1 X2 X3
1 G A C
2 G T C
3 G T C
4 A T G
5 A C G
And a vector like so:
ref <- c("A", "C", "C", "A", "G")
Going row-wise, how do I pull the column indices of the dataframe which match the vector?
So the answer should be a vector of numbers like so:
2, 3, 3, 1, 3
We can use
max.col(df1 == ref)
#[1] 2 3 3 1 3
data
df1 <- structure(list(X1 = c("G", "G", "G", "A", "A"), X2 = c("A", "T",
"T", "T", "C"), X3 = c("C", "C", "C", "G", "G")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5"))

How to apply separate_rows() to all columns, passing a `sep` parameter?

Pretty straight straightforward: I have a data frame where the values in many columns need to be split into their own rows, based on ;s as the delimiter.
After reading a bit,
df %>%
Reduce(separate_rows_, x = colnames)
works, except that I can't pass the sep parameter (so it also separates by white spaces, commas, and other non-alphanumeric chars).
One answer proposed writing a modified version of the function that includes the parameter, but I couldn't get that working:
Reduce(f = function(y) separate_rows_(sep = ";"), x = colnames)
What am I doing wrong?
Having said that, my ideal solution would be a tidyverse solution, if it's cleaner (maybe map_dfr?); but obviously any solution is better than none :).
Here's sample data:
structure(list(q1 = c("1,2,3,4", "2,4"), q2 = c("a,b", "e,f"),
q3 = c("c,d", "g,h,z")), row.names = 1:2, class = "data.frame")
Expected output:
structure(list(q1 = c("1", "1", "1", "1", "2", "2", "2", "2",
"3", "3", "3", "3", "4", "4", "4", "4", "2", "2", "2", "2", "2",
"2", "4", "4", "4", "4", "4", "4"), q2 = c("a", "a", "b", "b",
"a", "a", "b", "b", "a", "a", "b", "b", "a", "a", "b", "b", "e",
"e", "e", "f", "f", "f", "e", "e", "e", "f", "f", "f"), q3 = c("c",
"d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d",
"c", "d", "g", "h", "z", "g", "h", "z", "g", "h", "z", "g", "h",
"z")), row.names = c(NA, -28L), class = "data.frame")
The process I want to streamline is not having to pass every column name like so:
output <- test %>%
separate_rows(q1, sep = ",") %>%
separate_rows(q2, sep = ",") %>%
separate_rows(q3, sep = ",")
You can use purrr::reduce, which applies the given function .f to .init and the first element of .x, then applies the function to the output of that and the second element of .x, etc. until all elements of .x have been used.
Within the .f argument formula, .x is the previous output (or .init for the first run) and .y is the given element of the .x argument to reduce.
library(tidyverse)
reduce(.init = df, .x = names(df), .f = ~separate_rows(.x, .y, sep = ','))
# equiv to: reduce(.init = df, .x = names(df), .f = separate_rows, sep = ',')
As akrun notes in the comments, this can also be done in base R with the code below (same output)
Reduce(function(x, y) separate_rows(x, y, sep=","), names(df), init = df)
# q1 q2 q3
# 1 1 a c
# 2 1 a d
# 3 1 b c
# 4 1 b d
# 5 2 a c
# 6 2 a d
# 7 2 b c
# 8 2 b d
# 9 3 a c
# 10 3 a d
# 11 3 b c
# 12 3 b d
# 13 4 a c
# 14 4 a d
# 15 4 b c
# 16 4 b d
# 17 2 e g
# 18 2 e h
# 19 2 e z
# 20 2 f g
# 21 2 f h
# 22 2 f z
# 23 4 e g
# 24 4 e h
# 25 4 e z
# 26 4 f g
# 27 4 f h
# 28 4 f z

R replace values in column based on match between columns

I have two dataframes, each with the same columns. Some columns have the same values in the same order in both dataframes (X1, X2 below). Other columns have the same values, but in a different order (Y1). This is only a problem for some levels of first variables (here, the order of rows in Y1 differs for X1 == "a", but not X1 == "b"). Example:
df1 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("1", "2", "3", "1", "2", "3"),
"Y1" = c("d", "d", "f", "g", "h", "i"))
df2 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("1", "2", "3", "1", "2", "3"),
"Y1" = c("f", "d", "d", "g", "h", "i"))
I would like to change the values of df2$X1 and df2$X2 such that the two dataframes are matched on values of Y1.
I would like to change X1 and X2 rather than Y1 because there are many Y variables. I would like to do this only for df$X1 == "a".
The output should looks like this:
df2 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("3", "1", "2", "1", "2", "3"),
"Y1" = c("f", "d", "d", "g", "h", "i"))
What is a little tricky in your situation is that you have duplicates in the Y1 columns which correspond to different values in the X2 columns. So you will have to make these unique.
First, make sure that your Y1 columns are character vectors and not factors:
df1 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("1", "2", "3", "1", "2", "3"),
"Y1" = c("d", "d", "f", "g", "h", "i"),
stringsAsFactors = F)
df2 <- data.frame("X1" = c("a", "a", "a", "b", "b", "b"),
"X2" = c("1", "2", "3", "1", "2", "3"),
"Y1" = c("f", "d", "d", "g", "h", "i"),
stringsAsFactors = F)
Give unique names to your Y1 duplicates:
df1$Y1uniq <- make.unique(df1$Y1)
df2$Y1uniq <- make.unique(df2$Y1)
Then you can use match() using those uniques values (and remove that column once you don't need it anymore):
df1[match(df2$Y1uniq, df1$Y1uniq), ][ , 1:3]
Output:
X1 X2 Y1
3 a 3 f
1 a 1 d
2 a 2 d
4 b 1 g
5 b 2 h
6 b 3 i

Count occurrences per entry in dataframe

I have the following kind of dataframe (this is simplified example):
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df
id bank
1 1 a
2 1 b
3 1 c
4 2 b
5 3 b
6 3 c
7 4 a
8 4 c
In this dataframe you can see that for some ids there are multiple banks, i.e. for id==1, bank=c(a,b,c).
The information I would like to extract from this dataframe is the overlap between id's within different banks and the count.
So for example for bank a: bank a has two persons (unique ids): 1 and 4. For these persons, I want to know what other banks they have
For person 1: bank b and c
For person 4: bank c
the total amount of other banks: 3, for which, b = 1, and c = 2.
So I want to create as output a sort of overlap table as below:
bank overlap amount
a b 1
a c 2
b a 1
b c 2
c a 2
c b 2
Took me a while to get a result, so I post it. Not as sexy as Ronak Shahs but same result.
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df$bank <- as.character(df$bank)
resultlist <- list()
dflist <- split(df, df$id)
for(i in 1:length(dflist)) {
if(nrow(dflist[[i]]) < 2) {
resultlist[[i]] <- data.frame(matrix(nrow = 0, ncol = 2))
} else {
resultlist[[i]] <- as.data.frame(t(combn(dflist[[i]]$bank, 2)))
}
}
result <- setNames(data.table(rbindlist(resultlist)), c("bank", "overlap"))
result %>%
group_by(bank, overlap) %>%
summarise(amount = n())
bank overlap amount
<fct> <fct> <int>
1 a b 1
2 a c 2
3 b c 2
We may use data.table:
df = data.frame(id = c("1", "1", "1", "2", "3", "3", "4", "4"),
bank = c("a", "b", "c", "b", "b", "c", "a", "c"))
library(data.table)
setDT(df)[, .(bank = rep(bank, (.N-1L):0L),
overlap = bank[(sequence((.N-1L):1L) + rep(1:(.N-1L), (.N-1L):1))]),
by=id][,
.N, by=.(bank, overlap)]
#> bank overlap N
#> 1: a b 1
#> 2: a c 2
#> 3: b c 2
#> 4: <NA> b 1
Created on 2019-07-01 by the reprex package (v0.3.0)
Please note that you have b for id==2 which is not overlapping with other values. If you don't want that in the final product, just apply na.omit() on the output.
An option would be full_join
library(dplyr)
full_join(df, df, by = "id") %>%
filter(bank.x != bank.y) %>%
dplyr::count(bank.x, bank.y) %>%
select(bank = bank.x, overlap = bank.y, amount = n)
# A tibble: 6 x 3
# bank overlap amount
# <fct> <fct> <int>
#1 a b 1
#2 a c 2
#3 b a 1
#4 b c 2
#5 c a 2
#6 c b 2
Do you need to cover both banks in both the directions? Since a -> b is same as b -> a in this case here. We can use combn and create combinations of unique bank taken 2 at a time, find out length of common id found in the combination.
as.data.frame(t(combn(unique(df$bank), 2, function(x)
c(x, with(df, length(intersect(id[bank == x[1]], id[bank == x[2]])))))))
# V1 V2 V3
#1 a b 1
#2 a c 2
#3 b c 2
data
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank, stringsAsFactors = FALSE)

Resources