Pretty straight straightforward: I have a data frame where the values in many columns need to be split into their own rows, based on ;s as the delimiter.
After reading a bit,
df %>%
Reduce(separate_rows_, x = colnames)
works, except that I can't pass the sep parameter (so it also separates by white spaces, commas, and other non-alphanumeric chars).
One answer proposed writing a modified version of the function that includes the parameter, but I couldn't get that working:
Reduce(f = function(y) separate_rows_(sep = ";"), x = colnames)
What am I doing wrong?
Having said that, my ideal solution would be a tidyverse solution, if it's cleaner (maybe map_dfr?); but obviously any solution is better than none :).
Here's sample data:
structure(list(q1 = c("1,2,3,4", "2,4"), q2 = c("a,b", "e,f"),
q3 = c("c,d", "g,h,z")), row.names = 1:2, class = "data.frame")
Expected output:
structure(list(q1 = c("1", "1", "1", "1", "2", "2", "2", "2",
"3", "3", "3", "3", "4", "4", "4", "4", "2", "2", "2", "2", "2",
"2", "4", "4", "4", "4", "4", "4"), q2 = c("a", "a", "b", "b",
"a", "a", "b", "b", "a", "a", "b", "b", "a", "a", "b", "b", "e",
"e", "e", "f", "f", "f", "e", "e", "e", "f", "f", "f"), q3 = c("c",
"d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d", "c", "d",
"c", "d", "g", "h", "z", "g", "h", "z", "g", "h", "z", "g", "h",
"z")), row.names = c(NA, -28L), class = "data.frame")
The process I want to streamline is not having to pass every column name like so:
output <- test %>%
separate_rows(q1, sep = ",") %>%
separate_rows(q2, sep = ",") %>%
separate_rows(q3, sep = ",")
You can use purrr::reduce, which applies the given function .f to .init and the first element of .x, then applies the function to the output of that and the second element of .x, etc. until all elements of .x have been used.
Within the .f argument formula, .x is the previous output (or .init for the first run) and .y is the given element of the .x argument to reduce.
library(tidyverse)
reduce(.init = df, .x = names(df), .f = ~separate_rows(.x, .y, sep = ','))
# equiv to: reduce(.init = df, .x = names(df), .f = separate_rows, sep = ',')
As akrun notes in the comments, this can also be done in base R with the code below (same output)
Reduce(function(x, y) separate_rows(x, y, sep=","), names(df), init = df)
# q1 q2 q3
# 1 1 a c
# 2 1 a d
# 3 1 b c
# 4 1 b d
# 5 2 a c
# 6 2 a d
# 7 2 b c
# 8 2 b d
# 9 3 a c
# 10 3 a d
# 11 3 b c
# 12 3 b d
# 13 4 a c
# 14 4 a d
# 15 4 b c
# 16 4 b d
# 17 2 e g
# 18 2 e h
# 19 2 e z
# 20 2 f g
# 21 2 f h
# 22 2 f z
# 23 4 e g
# 24 4 e h
# 25 4 e z
# 26 4 f g
# 27 4 f h
# 28 4 f z
Related
I have information on groups of physicians working together in given hospitals. A physician can work in more than one hospital at the same time. I would like to write a code that outputs information of all indirect colleagues of a given physician working in a given hospital. For instance, if I work in a given hospital with another physician who also works in another hospital, I would like to know who are the physicians with whom my colleague works in this other hospital.
Consider a simple example of three hospitals (1, 2, 3) and five physicians (A, B, C, D, E). Physicians A, B and C work together in hospital 1. Physicians A, B and D work together in hospital 2. Physicians B and E work together in hospital 3.
For each physician working in a given hospital I would like information of their indirect colleagues through each of their direct colleagues. For example, physician A has one indirect colleague through physician B in hospital 1: this is physician E in hospital 3. On the other hand, physician B does not have any indirect colleague through physician A in hospital 1. Physician C has two indirect colleagues through physician B in hospital 1: they are physician D in hospital 2 and physician E in hospital 3. And so on..
Below is the object that describes the nertworks of physicians in all hospitals:
edges <- tibble(hosp = c("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "3", "3"),
from = c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B", "D", "D", "B", "E"),
to = c("C", "B", "C", "A", "B", "A", "D", "B", "A", "D", "A", "B", "E", "B")) %>% arrange(hosp, from, to)
I would like a code that produces the following output:
output <- tibble(hosp = c("1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3"),
from = c("A", "A", "B", "B", "C", "C", "C", "A", "A", "B", "B", "D", "D", "D", "B", "E", "E", "E", "E"),
to = c("C", "B", "C", "A", "B", "A", "B", "D", "B", "A", "D", "A", "B", "B", "E", "B", "B", "B", "B"),
hosp_ind = c("" , "3", "" , "" , "2", "2", "3", "" , "3", "" , "" , "1", "1", "3", "" , "1", "1", "2", "2"),
to_ind = c("" , "E", "" , "" , "D", "D", "E", "" , "E", "" , "" , "C", "C", "E", "" , "A", "C", "A", "D")) %>% arrange(hosp, from, to)
Here is one option using igraph + data.table
library(igraph)
library(data.table)
g <- simplify(graph_from_data_frame(edges, directed = FALSE))
res <- setDT(edges)[
,
c(.SD, {
to_ind <- setdiff(
do.call(
setdiff,
Map(names, ego(g, 2, c(to, from), mindist = 2))
), from
)
if (!length(to_ind)) {
hosp_ind <- to_ind <- NA_character_
} else {
hosp_ind <- lapply(to_ind, function(v) names(neighbors(g, v)))
}
data.table(
hosp_ind = unlist(hosp_ind),
to_ind = rep(to_ind, lengths(hosp_ind))
)
}),
.(id = seq(nrow(edges)))
][, id := NULL][]
and you will obtain
> res
hosp from to hosp_ind to_ind
1: 1 A B 3 E
2: 1 A C <NA> <NA>
3: 1 B A <NA> <NA>
4: 1 B C <NA> <NA>
5: 1 C A 2 D
6: 1 C B 2 D
7: 1 C B 3 E
8: 2 A B 3 E
9: 2 A D <NA> <NA>
10: 2 B A <NA> <NA>
11: 2 B D <NA> <NA>
12: 2 D A 1 C
13: 2 D B 1 C
14: 2 D B 3 E
15: 3 B E <NA> <NA>
16: 3 E B 1 A
17: 3 E B 2 A
18: 3 E B 1 C
19: 3 E B 2 D
Also, when you run plot(g), you will see the graph like below
Say I have a dataframe of letters like so:
X1 X2 X3
1 G A C
2 G T C
3 G T C
4 A T G
5 A C G
And a vector like so:
ref <- c("A", "C", "C", "A", "G")
Going row-wise, how do I pull the column indices of the dataframe which match the vector?
So the answer should be a vector of numbers like so:
2, 3, 3, 1, 3
We can use
max.col(df1 == ref)
#[1] 2 3 3 1 3
data
df1 <- structure(list(X1 = c("G", "G", "G", "A", "A"), X2 = c("A", "T",
"T", "T", "C"), X3 = c("C", "C", "C", "G", "G")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5"))
I have the following kind of dataframe (this is simplified example):
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df
id bank
1 1 a
2 1 b
3 1 c
4 2 b
5 3 b
6 3 c
7 4 a
8 4 c
In this dataframe you can see that for some ids there are multiple banks, i.e. for id==1, bank=c(a,b,c).
The information I would like to extract from this dataframe is the overlap between id's within different banks and the count.
So for example for bank a: bank a has two persons (unique ids): 1 and 4. For these persons, I want to know what other banks they have
For person 1: bank b and c
For person 4: bank c
the total amount of other banks: 3, for which, b = 1, and c = 2.
So I want to create as output a sort of overlap table as below:
bank overlap amount
a b 1
a c 2
b a 1
b c 2
c a 2
c b 2
Took me a while to get a result, so I post it. Not as sexy as Ronak Shahs but same result.
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df$bank <- as.character(df$bank)
resultlist <- list()
dflist <- split(df, df$id)
for(i in 1:length(dflist)) {
if(nrow(dflist[[i]]) < 2) {
resultlist[[i]] <- data.frame(matrix(nrow = 0, ncol = 2))
} else {
resultlist[[i]] <- as.data.frame(t(combn(dflist[[i]]$bank, 2)))
}
}
result <- setNames(data.table(rbindlist(resultlist)), c("bank", "overlap"))
result %>%
group_by(bank, overlap) %>%
summarise(amount = n())
bank overlap amount
<fct> <fct> <int>
1 a b 1
2 a c 2
3 b c 2
We may use data.table:
df = data.frame(id = c("1", "1", "1", "2", "3", "3", "4", "4"),
bank = c("a", "b", "c", "b", "b", "c", "a", "c"))
library(data.table)
setDT(df)[, .(bank = rep(bank, (.N-1L):0L),
overlap = bank[(sequence((.N-1L):1L) + rep(1:(.N-1L), (.N-1L):1))]),
by=id][,
.N, by=.(bank, overlap)]
#> bank overlap N
#> 1: a b 1
#> 2: a c 2
#> 3: b c 2
#> 4: <NA> b 1
Created on 2019-07-01 by the reprex package (v0.3.0)
Please note that you have b for id==2 which is not overlapping with other values. If you don't want that in the final product, just apply na.omit() on the output.
An option would be full_join
library(dplyr)
full_join(df, df, by = "id") %>%
filter(bank.x != bank.y) %>%
dplyr::count(bank.x, bank.y) %>%
select(bank = bank.x, overlap = bank.y, amount = n)
# A tibble: 6 x 3
# bank overlap amount
# <fct> <fct> <int>
#1 a b 1
#2 a c 2
#3 b a 1
#4 b c 2
#5 c a 2
#6 c b 2
Do you need to cover both banks in both the directions? Since a -> b is same as b -> a in this case here. We can use combn and create combinations of unique bank taken 2 at a time, find out length of common id found in the combination.
as.data.frame(t(combn(unique(df$bank), 2, function(x)
c(x, with(df, length(intersect(id[bank == x[1]], id[bank == x[2]])))))))
# V1 V2 V3
#1 a b 1
#2 a c 2
#3 b c 2
data
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank, stringsAsFactors = FALSE)
I have a 5 column by 100 row data frame. I want to count the number of pipe symbols | occurring in each column.
df <- as.data.frame(matrix(c(
c("1", "2", "3", "4", "5"),
c("A", "B", "C", "B", "B"),
c("|", "W", "G", "|", "D"),
c("Q", "D", "F", "|", "F"),
c("Q", "|", "|", "|", "Q")),
5, 5, byrow=T)
)
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 A B C B B
3 | W G | D
4 Q D F | F
5 Q | | | Q
I'd like a result showing 1 pipe in column 1, 1 pipe in column 2, 1 pipe in column 3, 3 pipes in column 4, 0 pipes in column 5
Another way to do it is using colSums() on Dan Y's data frame.
colSums(df == "|")
V1 V2 V3 V4 V5
1 1 1 3 0
If each string is just single character, you can do a simple sapply:
# turning the example data you provided into a data.frame
df <- as.data.frame(matrix(c(
c("1", "2", "3", "4", "5"),
c("A", "B", "C", "B", "B"),
c("|", "W", "G", "|", "D"),
c("Q", "D", "F", "|", "F"),
c("Q", "|", "|", "|", "Q")),
5, 5, byrow=T)
)
# calculation you want
sapply(df, function(x) sum(x == "|"))
# result = c(1, 1, 1, 3, 0)
I have a data frame arranged as follows:
df <- structure(list(name1= c("A","A","B"),
name2 = c("B", "C","C"),
size = c(10,20,30)),.Names=c("name1","name2","size"),
row.names = c("1", "2", "3"), class =("data.frame"))
I would like to add "mirror" observations as follows:
df <- structure(list(name1 = c("A","B","A", "C", "B", "C"),
name2 = c("B", "A","C", "A", "C", "B"),
size = c(10,10,20,20,30,30)),.Names=c("name1","name2","size"),
row.names = c("1", "2", "3", "4", "5", "6"), class =("data.frame"))
Inputs would be much appreciated.
We can do this in two steps,
df1 <- df[rep(rownames(df), each = 2),]
df1[c(FALSE, TRUE), 1:2] <- df1[c(FALSE, TRUE), 2:1]
df1
# name1 name2 size
#1 A B 10
#1.1 B A 10
#2 A C 20
#2.1 C A 20
#3 B C 30
#3.1 C B 30
We can do
library(data.table)
rbindlist(list(df, df[c(2:1, 3)]))