I'm looking for an easy and elegant way to accomplish this.
So if I have dataset x and relationship is A -> B -> Z -> Y and D -> H -> G, I would like to create dataset y. Unfortunately, they are not necessarily in order:
> x <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("B", "E", "Z", "H", "G", "Y")))
>
> y <- data.frame(
+ from = as.character(c("A", "E", "B", "D", "H", "Z")),
+ to = as.character(c("Y", "E", "Y", "G", "G", "Y")))
>
> x
from to
1 A B
2 E E
3 B Z
4 D H
5 H G
6 Z Y
> y
from to
1 A Y
2 E E
3 B Y
4 D G
5 H G
6 Z Y
I have a fairly large dataset (currently 500k rows; will grow in the future) and actually care about the performance; I'm not sure if there are any other ways to do this without a for-loop or even to vectorize/parallelize the process.
I'm thinking about splitting and removing all rows where from == to or creating an indicator to skip certain rows so the loop does not have to go through the entire dataset each time.
I'd also like to know what the breakpoint should be if I do create a loop; I'm not sure how to define when the loop should stop.
Any suggestions would be appreciated. Thanks!
We can use dplyr to create a grouping variable by comparing the adjacent elements of 'to' and 'from' and change the values in 'to' the last element of 'to'
library(dplyr)
x %>%
group_by(grp = cumsum(lag(lead(from, default = last(from)) !=
as.character(to), default = TRUE))) %>%
mutate(to = last(to)) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 2
# from to
# <fctr> <fctr>
#1 A D
#2 B D
#3 C D
#4 E E
Another solution can be achieved using lag from dplyr and fill from tidyr as:
library(tidyverse)
x %>% arrange(from) %>%
mutate(samegroup = ifelse(from == lag(to), 1, 0)) %>%
mutate(group = ifelse(samegroup == 0 | is.na(samegroup), row_number(), NA)) %>%
fill(group) %>%
group_by(group) %>%
mutate(to = last(to)) %>%
ungroup() %>%
select(-samegroup, - group)
# A tibble: 6 x 2
# from to
# <chr> <chr>
#1 A D
#2 B D
#3 C D
#4 E E
#5 F H
#6 G H
Data used
x <- data.frame(from = as.character(c("A", "B", "F", "C", "G", "E")),
to = as.character(c("B", "C", "G", "D", "H", "E")),
stringsAsFactors = FALSE)
Related
I am working with a list of edges in R:
data <- structure(list(var1 = c("a", "b", "c", "d", "f", "g", "h"), var2 = c("b",
"c", "a", "e", "g", "h", "i")), class = "data.frame", row.names = c(NA,
-7L))
> data
var1 var2
1 a b
2 b c
3 c a
4 d e
5 f g
6 g h
7 h i
I derived an igraph object from it:
library(igraph)
a <- graph_from_data_frame(data)
> a
IGRAPH 4cd4c06 DN-- 9 7 --
+ attr: name (v/c)
+ edges from 4cd4c06 (vertex names):
[1] a->b b->c c->a d->e f->g g->h h->I
and I have to test whether I have all the combinations between the vertices for every community in my data. I know every community should have nC2 edges, where n represents the number of nodes in the community, but I am not sure on how to do it with igraph.
In the example above, community 1 and 2 should be valid, as they have all the contribution between vertices, while community 3 shouldn't.
How do I test this?
As the desired output, ideally I would like to have something like this:
> data2
var1 var2 valid
1 a b TRUE
2 b c TRUE
3 c a TRUE
4 d e TRUE
5 f g FALSE
6 g h FALSE
7 h i FALSE
or anything that would allow me to identify the incomplete pairs.
Thanks!
You can use membership like below
data %>%
mutate(grp = membership(components(a))[var1]) %>%
group_by(grp) %>%
mutate(valid = choose(n_distinct(c(var1, var2)), 2) == n()) %>%
ungroup()
which gives
# A tibble: 7 × 4
var1 var2 grp valid
<chr> <chr> <dbl> <lgl>
1 a b 1 TRUE
2 b c 1 TRUE
3 c a 1 TRUE
4 d e 2 TRUE
5 f g 3 FALSE
6 g h 3 FALSE
7 h i 3 FALSE
where grp indicates how the vertices are clustered.
The following does not output the expected format in the question but it does return a FALSE/TRUE vector by community.
data <- structure(list(
var1 = c("a", "b", "c", "d", "f", "g", "h"),
var2 = c("b","c", "a", "e", "g", "h", "i")),
class = "data.frame", row.names = c(NA,-7L))
suppressPackageStartupMessages(
library(igraph)
)
a <- graph_from_data_frame(data)
cmp <- components(a)
tapply(names(cmp$membership), cmp$membership, FUN = \(v) {
cmb <- combn(v, 2, \(w) distances(as.undirected(a), v = w[1], to = w[2]))
all(cmb == 1)
})
#> 1 2 3
#> TRUE TRUE FALSE
Created on 2022-10-13 with reprex v2.0.2
I want to crosstabulate x by y to obtain in the table cells, the values of z.
library(tidyverse)
df <- tibble(x = c("a", "a", "b", "b"),
y = c("c", "d", "c", "d"),
z = c("e", "g", "f", "h"))
# I want to obtain this result:
# c d
# a e g
# b f h
Created on 2021-07-18 by the reprex package (v2.0.0)
I think you want tidyr::pivot_wider...
df %>% pivot_wider(names_from = y, values_from = z)
# A tibble: 2 x 3
x c d
<chr> <chr> <chr>
1 a e g
2 b f h
df <- data.frame(X = c("a", "b", "c", "a", "b", "c", "a", "b", "c", "d" , "a", "b", "c", "d", "e"),
Y = c("w", "w", "w", "K", "K", "K", "L", "L", "L", "L", "Z", "Z", "Z", "Z", "Z"))
Note that the first vector has 5 levels and the second has 4 levels. My goal is to select df lines that have all levels of vector 1 in common as vector 2. That is, I want to select lines that have levels "a", "b" and "c" since " d "appears only twice" and "appears only in vector 1.
I tried to make a list with the common levels and leave only the lines with the common levels by subset. However, it doesn't work because this level list doesn't generate the address of the lines I want to remove. Ex:
common <- c ("a", "b", "c")
df2 <- df [c(common),]
In my real df, there are 64 levels in common, so it doesn't happen "to do by hand". Can someone help me?
I think this is what you want. Essentially splitting X by Y, then looking for all intersecting values that are in every set.
df[df$X %in% Reduce(intersect, split(df$X, df$Y)),]
# X Y
#1 a w
#2 b w
#3 c w
#4 a K
#5 b K
#6 c K
#7 a L
#8 b L
#9 c L
#11 a Z
#12 b Z
#13 c Z
Another way could be to group_by X and select groups which has all distinct values in Y.
library(dplyr)
df %>%
group_by(X) %>%
filter(n_distinct(Y) == n_distinct(.$Y))
# X Y
# <fct> <fct>
# 1 a w
# 2 b w
# 3 c w
# 4 a K
# 5 b K
# 6 c K
# 7 a L
# 8 b L
# 9 c L
#10 a Z
#11 b Z
#12 c Z
In base R, that would be using ave
subset(df, as.logical(ave(as.character(Y), X,
FUN = function(x) length(unique(x)) == length(unique(Y)))))
Using data.table
library(data.table)
setDT(df)[, .SD[uniqueN(Y) == uniqueN(df$Y)], by = X]
I have co-occurrence data that can be represented in two columns. The entries in each column are from the same set of possibilities. Ultimately I am aiming to plot a directed network but first I would like to split the table into those that reciprocal (i.e. both X->Y and Y->X) and those that occur in just one direction (i.e. only Y->Z). Here is an example:
library(tidyverse)
# Example data
from <- c("A", "B", "F", "Q", "T", "S", "D", "E", "A", "T", "F")
to <- c("E", "D", "Q", "S", "F", "T", "B", "A", "D", "A", "E")
df <- data_frame(from, to)
df
# A tibble: 11 x 2
from to
<chr> <chr>
1 A E
2 B D
3 F Q
4 Q S
5 T F
6 S T
7 D B
8 E A
9 A D
10 T A
11 F E
and here is my desired output:
# Desired output 1 - reciprocal co-occurrences
df %>%
slice(c(1,2)) %>%
rename(item1 = from, item2 = to)
# A tibble: 2 x 2
item1 item2
<chr> <chr>
1 A E
2 B D
# Desired output 2 - single occurrences
df %>%
slice(c(3,4,6,6,9,10,11))
# A tibble: 7 x 2
from to
<chr> <chr>
1 F Q
2 Q S
3 S T
4 S T
5 A D
6 T A
7 F E
If the co-occurrences are reciprocal it does not matter what order the entries are in I only need their names co-occurrences are not I need to know the direction.
This feels like a graph problem so I have had a go but am unfamiliar with working with this type of data and most tutorials seem to cover undirected graphs. Looking at the tidygraph package which I understand uses the igraph package I have tried this:
library(tidygraph)
df %>%
as_tbl_graph(directed = TRUE) %>%
activate(edges) %>%
mutate(recip_occur = edge_is_mutual()) %>%
as_tibble() %>%
filter(recip_occur == TRUE)
# A tibble: 4 x 3
from to recip_occur
<int> <int> <lgl>
1 1 8 TRUE
2 2 7 TRUE
3 7 2 TRUE
4 8 1 TRUE
However this divorces the edges from the nodes and repeats reciprocal co-occurrences. Does anyone have experience with this sort of data?
try this:
data:
from <- c("A", "B", "F", "Q", "T", "S", "D", "E", "A", "T", "F")
to <- c("E", "D", "Q", "S", "F", "T", "B", "A", "D", "A", "E")
df <- data_frame(from, to)
code:
recursive_IND <-
1:nrow(df) %>%
sapply(function(x){
if(any((as.character(df[x,]) == t(df[,c(2,1)])) %>% {.[1,] & .[2,]}))
return(T) else return(F)
})
df[recursive_IND,][!(df[recursive_IND,] %>% apply(1,sort) %>% t %>% duplicated(.)),]
df[!recursive_IND,]
result:
# A tibble: 2 x 2
# from to
# <chr> <chr>
#1 A E
#2 B D
# A tibble: 7 x 2
# from to
# <chr> <chr>
#1 F Q
#2 Q S
#3 T F
#4 S T
#5 A D
#6 T A
#7 F E
Here is my data:
df <- tibble::tribble(
~A, ~B,
"C", "G",
"D", "H",
"E", "I",
"F", "J")
value1 <- "D"
value2 <- "C"
And, in variable A, I want to replace D and C with "m" and "n", something like this, but it's not working!
df %>% mutate(X = A %>% str_replace_all(c(value1 = "m", value2 = "n")))
My desired output is:
df %>% mutate(X = A %>% str_replace_all(c("D" = "m", "C" = "n")))
But instead of supplying "D" and "C" manually, I want to programmatically supply these, something in line with...using value1 and value2.
How should I do that?
You could try using setNames to set the names of m and n like:
library(dplyr)
library(stringr)
df %>% mutate(X = A %>% str_replace_all(setNames(c("m","n"), c(value1, value2))))
# A tibble: 4 x 3
# A B X
# <chr> <chr> <chr>
#1 C G n
#2 D H m
#3 E I E
#4 F J F
And then checking that it's equal to your desired result:
identical(
df %>% mutate(X = A %>% str_replace_all(c("D" = "m", "C" = "n"))),
df %>% mutate(X = A %>% str_replace_all(setNames(c("m","n"), c(value1, value2)))))
#[1] TRUE
I also included the other packages you use: dplyr and stringr
You can think of creating a named vector and use it as replacement vector.
replacementVector <- c("m","n")
names(replacementVector) <- c("D","C")
Now, use the replacementVector in dplyr chain along with ifelse as:
df %>% mutate(X = ifelse(is.na(replacementVector[A]), A, replacementVector[A]))
# # A tibble: 4 x 3
# A B X
# <chr> <chr> <chr>
# 1 C G n
# 2 D H m
# 3 E I E
# 4 F J F
Data:
library(tidyverse)
df <- tibble::tribble(
~A, ~B,
"C", "G",
"D", "H",
"E", "I",
"F", "J")
As is vectorized over string and replacement if you put all the values in the same vector you can just run
df %>% mutate(X = A %>% str_replace_all(c("C","D"), c("m","n")))
We could use chartr
df %>%
mutate(X = chartr('DC', 'mn', A))
# A tibble: 4 x 3
# A B X
# <chr> <chr> <chr>
#1 C G n
#2 D H m
#3 E I E
#4 F J F