Consider two tibbles data and key, given here:
library(tidyverse) # v1.3.2
set.seed(123)
data <- tibble(id = rep(LETTERS[1:10], each = 10),
position = rep(1:10, 10),
zip = sample(letters, 100, replace = T),
zap = sample(letters, 100, replace = T),
zop = sample(letters, 100, replace = T))
# A tibble: 100 × 5
id position zip zap zop
<chr> <int> <chr> <chr> <chr>
1 A 1 l n u
2 A 2 y f h
3 A 3 n y u
4 A 4 c h g
5 A 5 n l t
6 A 6 g z r
7 A 7 c d q
8 A 8 w m a
9 A 9 v n b
10 A 10 z u q
# … with 90 more rows
key <- tibble(id = c("A","D","H"),
start = c(2, 5, 7),
end = c(4, 6, 9))
# A tibble: 3 × 3
id start end
<chr> <dbl> <dbl>
1 A 2 4
2 D 5 6
3 H 7 9
And the desired output:
# A tibble: 8 × 5
id position zip zap zop
<chr> <int> <chr> <chr> <chr>
1 A 2 s u w
2 A 3 n e a
3 A 4 c h h
4 D 5 j j w
5 D 6 m e z
6 H 7 m v h
7 H 8 e q w
8 H 9 v j y
What's the most efficient way to subset data by id and the range of position given by key? I can think of two approaches, but neither is very fast.
1. apply() across rows of key, and bind the pieces
apply(X = key, MARGIN = 1, function(x) {
data |>
dplyr::filter(id == x[1],
position %in% x[2]:x[3])
}
) |> dplyr::bind_rows()
2. pivot and fill key, then join()
key |> tidyr::pivot_longer(cols = c(start, end),
values_to = "position") |>
dplyr::select(id, position) |>
dplyr::group_by(id) |>
tidyr::complete(position = seq(from = min(position),
to = max(position))) |>
dplyr::left_join(data)
What tidy approach would likely be fastest given data with millions of lines and a key with hundreds?
We may do an inner_join and then slice after grouping
library(dplyr)
inner_join(data, key) %>%
group_by(id) %>%
slice(first(start):first(end)) %>%
ungroup %>%
select(-c(start, end))
-output
# A tibble: 8 × 5
id position zip zap zop
<chr> <int> <chr> <chr> <chr>
1 A 2 s u w
2 A 3 n e a
3 A 4 c h h
4 D 5 j j w
5 D 6 m e z
6 H 7 m v h
7 H 8 e q w
8 H 9 v j y
Or another option is to make use of cur_group() after grouping by 'id' to subset the 'key' row
data %>%
filter(id %in% key$id) %>%
group_by(id) %>%
filter(row_number() >= key$start[match(cur_group()$id, key$id)],
row_number() <= key$end[match(cur_group()$id, key$id)] ) %>%
ungroup
-output
# A tibble: 8 × 5
id position zip zap zop
<chr> <int> <chr> <chr> <chr>
1 A 2 s u w
2 A 3 n e a
3 A 4 c h h
4 D 5 j j w
5 D 6 m e z
6 H 7 m v h
7 H 8 e q w
8 H 9 v j y
I did some benchmarking of my methods and the methods provided by akrun. Overall, it seems like the function that uses inner_join is most efficient.
Load libraries and create mock data d_ and key k_
library(tidyverse)
library(microbenchmark)
set.seed(123)
d_ <- tibble(id = rep(LETTERS[1:20], each = 1000),
position = rep(1:1000, 20))
k_ <- tibble(id = LETTERS[1:20],
start = as.double(sample(500,20)),
end = start + 300)
Write different methods as functions
method1 <- function(data, key) {
apply(X = key, MARGIN = 1, function(x) {
data |> dplyr::filter(id == x[1],
position %in% x[2]:x[3])
}
) |> dplyr::bind_rows()
}
method2 <- function(data, key) {
key |> tidyr::pivot_longer(cols = c(start, end),
values_to = "position") |>
dplyr::select(id, position) |>
dplyr::group_by(id) |>
tidyr::complete(position = seq(from = min(position),
to = max(position))) |>
dplyr::left_join(data)
}
method3 <- function(data, key) {
dplyr::inner_join(data, key) |>
group_by(id) |>
dplyr::slice(dplyr::first(start):dplyr::first(end)) |>
dplyr::ungroup() |>
dplyr::select(-c(start, end))
}
method4 <- function(data, key) {
data |>
dplyr::filter(id %in% key$id) |>
dplyr::group_by(id) |>
dplyr::filter(dplyr::row_number() >= key$start[match(dplyr::cur_group()$id,
key$id)],
dplyr::row_number() <= key$end[match(dplyr::cur_group()$id,
key$id)]
) |>
dplyr::ungroup()
}
Evaluate each function 100 times with microbenchmark
mbm <- microbenchmark("acvill 1" = { method1(d_, k_) },
"acvill 2" = { method2(d_, k_) },
"akrun 1" = { method3(d_, k_) },
"akrun 2" = { method4(d_, k_) },
times = 100)
Plot benchmarking results
ggplot(data = tibble(method = mbm$expr, time = mbm$time)) +
geom_violin(mapping = aes(x = method, y = time/10^6, fill = method)) +
ylab("milliseconds") +
theme_classic() +
scale_x_discrete(limits = rev) +
scale_y_continuous(limits = c(0,400),
breaks = seq(0,400,50)) +
theme(axis.title.y = element_blank(),
axis.text = element_text(color = "black", size = 10),
legend.position = "none") +
coord_flip()
Related
I have this sort of data:
df <- data.frame(
id = sample(1:5, 100, replace = TRUE),
dur = sample(c(NA, rnorm(10)), 100, replace = TRUE),
char = sample(LETTERS, 100, replace = TRUE)
)
From this I can compute counts and proportions of the variable char:
library(dplyr)
df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq', sort = TRUE) %>%
mutate(prop = prop.table(freq) * 100)
char freq prop
1 C 6 8.571429
2 M 6 8.571429
3 X 5 7.142857
4 Y 5 7.142857
5 Z 5 7.142857
6 E 4 5.714286
7 I 4 5.714286
8 K 4 5.714286
9 J 3 4.285714
10 Q 3 4.285714
... clipped
Now, in df, the char values also have duration values. So I want to add another column, say mean_dur, with the mean dur values grouped by char in df. Adding on something like group_by(char) etc. to the above code doesn't work as the variable char is no longer recognized. How can that be achieved?
EDIT:
It can be done in steps, like this:
# Step 1 -- make df with counts and proportions:
df1 <- df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq', sort = TRUE) %>%
mutate(prop = prop.table(freq) * 100)
# Step 2 -- make another df with mean dur values:
df2 <- df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(mean_dur = mean(dur, na.rm = TRUE))
# Step 3 -- transfer mean dur values by matching `char`in `df1`and `df2`
df1$mean_dur <- df2$mean_dur[match(df1$char, df2$char)]
But is there a cleaner and tidyer dplyr way?
EDIT 2:
Thanks to #Anoushiravan R's solution, from which I picked the left_join idea, this seems like a clean and tidy solution (and it does not require the package janitor):
df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq', sort = TRUE) %>%
mutate(prop = prop.table(freq) * 100) %>%
left_join(df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(mean_dur = mean(dur)), by = "char")
I hope this is what you are looking for:
library(dplyr)
library(janitor)
df %>%
filter(!is.na(dur) & !id == lag(id)) %>%
tabyl(char) %>%
rename(freq = percent) %>%
mutate(freq = freq * 100) %>%
select(-n) %>%
arrange(desc(freq)) %>%
left_join(df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(mean_dur = mean(dur)), by = "char")
char freq mean_dur
T 7.894737 -0.4861708
Z 7.894737 -0.2867046
A 6.578947 -0.5056797
B 5.263158 0.3513478
E 5.263158 0.5113139
K 5.263158 -1.4560764
L 5.263158 0.8235192
N 5.263158 0.9037481
X 5.263158 -1.4669529
C 3.947368 -0.4064762
I 3.947368 -0.7722133
P 3.947368 -0.1076928
U 3.947368 0.5573875
Y 3.947368 0.2404896
D 2.631579 0.5942473
F 2.631579 1.2381883
G 2.631579 -0.2155605
J 2.631579 1.0528329
M 2.631579 -1.5482806
O 2.631579 0.2813264
S 2.631579 1.2132490
V 2.631579 0.6157874
H 1.315789 -1.2664754
Q 1.315789 1.1027114
R 1.315789 0.1288634
W 1.315789 1.0528329
If you're prepared to give up prop.table, then I think this gives you what you want...
df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(
n=n(),
prop = 100*n/nrow(.),
mean_dur=mean(dur, na.rm=TRUE),
.groups="drop"
)
# A tibble: 25 x 4
char n prop mean_dur
* <fct> <int> <dbl> <dbl>
1 A 6 8.82 0.158
2 B 5 7.35 -0.144
3 C 2 2.94 0.951
4 D 2 2.94 0.518
5 E 5 7.35 0.211
6 F 3 4.41 0.333
7 G 2 2.94 0.951
8 H 3 4.41 0.624
9 I 2 2.94 -0.422
10 J 2 2.94 -0.347
# … with 15 more rows
[It took me a while to notice you were working with random data. set.seed() would have been helpful! ;=) ]
Edited in line with comment below
Another option:
mean_dur <- df %>% group_by(char) %>% summarise(mean_dur=mean(dur,na.rm=T))
tab <- df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq') %>%
mutate(prop = prop.table(freq) * 100)
tab <- merge.data.frame(tab,mean_dur)
tab <- tab[order(tab$freq,decreasing = T),]
char freq prop mean_dur
17 R 6 8.108108 -0.75610907
3 D 5 6.756757 -0.61657511
5 F 5 6.756757 -0.34153689
10 K 5 6.756757 -0.90688768
19 T 5 6.756757 0.33628707
6 G 4 5.405405 -0.93390134
9 J 4 5.405405 0.27471673
11 L 4 5.405405 0.87029782
13 N 4 5.405405 0.17163797
16 Q 4 5.405405 -0.67554378
22 X 4 5.405405 -0.42108346
7 H 3 4.054054 0.36290234
14 O 3 4.054054 -0.56712470
15 P 3 4.054054 0.08316665
2 C 2 2.702703 -1.15398142
4 E 2 2.702703 -0.31271923
12 M 2 2.702703 -0.96001502
18 S 2 2.702703 -0.88921047
20 U 2 2.702703 0.24299241
21 W 2 2.702703 -1.32772406
1 A 1 1.351351 0.24299241
8 I 1 1.351351 -1.07336407
23 Z 1 1.351351 -1.07336407
I have a function which returns a tibble. It runs OK, but I want to vectorize it.
library(tidyverse)
tibTest <- tibble(argX = 1:4, argY = 7:4)
square_it <- function(xx, yy) {
if(xx >= 4){
tibble(x = NA, y = NA)
} else if(xx == 3){
tibble(x = as.integer(), y = as.integer())
} else if (xx == 2){
tibble(x = xx^2 - 1, y = yy^2 -1)
} else {
tibble(x = xx^2, y = yy^2)
}
}
It runs OK in a mutate when I call it with map2, giving me the result I wanted:
tibTest %>%
mutate(sq = map2(argX, argY, square_it)) %>%
unnest()
## A tibble: 3 x 4
# argX argY x y
# <int> <int> <dbl> <dbl>
# 1 1 7 1 49
# 2 2 6 3 35
# 3 4 4 NA NA
My first attempt to vectorize it failed, and I can see why - I can't return a vector of tibbles.
square_it2 <- function(xx, yy){
case_when(
x >= 4 ~ tibble(x = NA, y = NA),
x == 3 ~ tibble(x = as.integer(), y = as.integer()),
x == 2 ~ tibble(x = xx^2 - 1, y = yy^2 -1),
TRUE ~ tibble(x = xx^2, y = yy^2)
)
}
# square_it2(4, 2) # FAILS
My next attempt runs OK on a simple input. I can return a list of tibbles, and that's what I want for the unnest
square_it3 <- function(xx, yy){
case_when(
xx >= 4 ~ list(tibble(x = NA, y = NA)),
xx == 3 ~ list(tibble(x = as.integer(), y = as.integer())),
xx == 2 ~ list(tibble(x = xx^2 - 1, y = yy^2 -1)),
TRUE ~ list(tibble(x = xx^2, y = yy^2))
)
}
square_it3(4, 2)
# [[1]]
# # A tibble: 1 x 2
# x y
# <lgl> <lgl>
# 1 NA NA
But when I call it in a mutate, it doesn't give me the result I had with square_it. I can sort of see what's
wrong. In the xx == 2 clause, xx acts as an atomic value of 2. But in
building the tibble, xx is a length-4 vector.
tibTest %>%
mutate(sq = square_it3(argX, argY)) %>%
unnest()
# # A tibble: 9 x 4
# argX argY x y
# <int> <int> <dbl> <dbl>
# 1 1 7 1 49
# 2 1 7 4 36
# 3 1 7 9 25
# 4 1 7 16 16
# 5 2 6 0 48
# 6 2 6 3 35
# 7 2 6 8 24
# 8 2 6 15 15
# 9 4 4 NA NA
How do I get the same result as I did with square_it, but from a vectorized function using case_when ?
We define row_case_when which has a similar formula interface as case_when except it has a first argument of .data, acts by row and expects that the value of each leg to be a data frame. It returns a data.frame/tibble. Wrapping in a list, rowwise and unnest are not needed.
case_when2 <- function (.data, ...) {
fs <- dplyr:::compact_null(rlang:::list2(...))
n <- length(fs)
if (n == 0) {
abort("No cases provided")
}
query <- vector("list", n)
value <- vector("list", n)
default_env <- rlang:::caller_env()
quos_pairs <- purrr::map2(fs, seq_along(fs), dplyr:::validate_formula,
rlang:::default_env, rlang:::current_env())
for (i in seq_len(n)) {
pair <- quos_pairs[[i]]
query[[i]] <- rlang::eval_tidy(pair$lhs, data = .data, env = default_env)
value[[i]] <- rlang::eval_tidy(pair$rhs, data = .data, env = default_env)
if (!is.logical(query[[i]])) {
abort_case_when_logical(pair$lhs, i, query[[i]])
}
if (query[[i]]) return(value[[i]])
}
}
row_case_when <- function(.data, ...) {
.data %>%
group_by(.group = 1:n(), !!!.data) %>%
do(case_when2(., ...)) %>%
mutate %>%
ungroup %>%
select(-.group)
}
Test run
It is used like this:
library(dplyr)
tibTest <- tibble(argX = 1:4, argY = 7:4) # test data from question
tibTest %>%
row_case_when(argX >= 4 ~ tibble(x = NA, y = NA),
argX == 3 ~ tibble(x = as.integer(), y = as.integer()),
argX == 2 ~ tibble(x = argX^2 - 1, y = argY^2 -1),
TRUE ~ tibble(x = argX^2, y = argY^2)
)
giving:
# A tibble: 3 x 4
argX argY x y
<int> <int> <dbl> <dbl>
1 1 7 1 49
2 2 6 3 35
3 4 4 NA NA
mutate_cond and mutate_when
These are not quite the same as row_case_when since they don't run through conditions taking the first true one but by using mutually exclusive conditions they can be used for certain aspects of this problem. They do not handle changing the number of rows in the result but we can use dplyr::filter to remove rows for a particular condition.
mutate_cond defined in dplyr mutate/replace several columns on a subset of rows is like mutate except the second argument is a condition and the subsequent arguments are applied only to rows for which that condition is TRUE.
mutate_when defined in
dplyr mutate/replace several columns on a subset of rows is similar to case_when except it applies to rows, the replacement values are provided in a list and alternate arguments are conditions and lists. Also all legs are always run applying the replacement values to the rows satisfying the conditions (as opposed to, for each row, performing the replacement on just the first true leg). To get a similar effect to row_case_when be sure that the conditions are mutually exclusive.
# mutate_cond example
tibTest %>%
filter(argX != 3) %>%
mutate(x = NA_integer_, y = NA_integer_) %>%
mutate_cond(argX == 2, x = argX^2 - 1L, y = argY^2 - 1L) %>%
mutate_cond(argX < 2, x = argX^2, y = argY^2)
# mutate_when example
tibTest %>%
filter(argX != 3) %>%
mutate_when(TRUE, list(x = NA_integer_, y = NA_integer_),
argX == 2, list(x = argX^2 - 1L, y = argY^2 - 1L),
argX < 2, list(x = argX^2, y = argY^2))
You need to ensure you are creating a 1-row tibble with each call of the function, then vectorize that.
This works whether you have rowwise groups or not.
You can do this with switch wrapped in a map2:
Here's a reprex:
library(tidyverse)
tibTest <- tibble(argX = 1:4, argY = 7:4)
square_it <- function(xx, yy) {
map2(xx, yy, function(x, y){
switch(which(c(x >= 4,
x == 3,
x == 2,
x < 4 & x != 3 & x != 2)),
tibble(x = NA, y = NA),
tibble(x = as.integer(), y = as.integer()),
tibble(x = x^2 - 1, y = y^2 -1),
tibble(x = x^2, y = y^2))})
}
tibTest %>% mutate(sq = square_it(argX, argY)) %>% unnest(cols = sq)
#> # A tibble: 3 x 4
#> argX argY x y
#> <int> <int> <dbl> <dbl>
#> 1 1 7 1 49
#> 2 2 6 3 35
#> 3 4 4 NA NA
Created on 2020-05-16 by the reprex package (v0.3.0)
I want to aggregate a data.frame with two columns: in one column I have "num", which is an identifier number and in the other I have text. It is important that the aggregated text has a space between the individual parts. My code is this:
data_aggr <- aggregate(
x = data_aggr,
FUN = paste,
by = list(data_aggr$num)
)
I have tried the obvious with FUN = paste(collapse = " ") and
FUN = paste,
collapse = " ",
but that doesn't work. How do I need to do this?
Aggregate can be used to paste together the rows with the same value of num as follows:
data_aggr <- data.frame(num=c(1,1,1,2,2), letters=letters[1:5])
aggregate(data_aggr$letters, list(data_aggr$num), FUN=paste, collapse= " ")
# Group.1 x
# 1 1 a b c
# 2 2 d e
A dplyr solution, the idea is to create a new column with row number to be able to conduct the operation on each row.
> library(dplyr)
> df.ask <- data.frame('Num' = 1:10,
+ 'Text' = letters[1:10])
>
> df.ask %>%
+ mutate(row_num = row_number()) %>%
+ group_by(row_num) %>%
+ mutate(together = paste(Num, Text, collapse = ' ')) %>%
+ ungroup() %>%
+ select(-row_num)
# A tibble: 10 x 3
Num Text together
<int> <fct> <chr>
1 1 a 1 a
2 2 b 2 b
3 3 c 3 c
4 4 d 4 d
5 5 e 5 e
6 6 f 6 f
7 7 g 7 g
8 8 h 8 h
9 9 i 9 i
10 10 j 10 j
I work with R, and I have a table xy like this
View( xy)
X Y
21 A
33 B
24 B
16 A
25 B
31 A
17 B
14 A
Now, I want to make groups of x and y and frequencies in steps of 10 like this at the end
Class A B
I (1-10) 0 0
II (11-20) 2 1
III (21-30) 1 2
And so on
First create the labels using either the commented out hard coded labels or the computed labels lab. Then use cut and table to create the resulting table.
# lab <- c("I (1-10)", "II (11-20)", "III (21-30)", "IV (31-40)")
n <- ceiling(max(DF$X) / 10) # 4
bounds <- seq(0, 10*n, 10) # c(0, 10, 20, 30, 40)
lab <- sprintf("%s (%d-%d)", as.roman(1:n), head(bounds, -1) + 1, bounds[-1])
Class <- cut(DF$X, bounds, lab = lab)
table(Class, Y = DF$Y)
giving:
Y
Class A B
I (1-10) 0 0
II (11-20) 2 1
III (21-30) 1 2
IV (31-40) 1 1
Note
We assume the input data frame DF is the following shown in reproducible form:
Lines <- "
X Y
21 A
33 B
24 B
16 A
25 B
31 A
17 B
14 A"
DF <- read.table(text = Lines, header = TRUE)
One tidyverse possibility could be:
df %>%
mutate(Class = X %/% 10) %>%
count(Y, Class) %>%
group_by(Y) %>%
complete(Class = seq(0, max(Class), 1)) %>%
spread(Y, n, fill = 0)
Class A B
<dbl> <dbl> <dbl>
1 0 0 0
2 1 2 1
3 2 1 2
4 3 1 1
Or if you want also the ranges:
df %>%
mutate(Class = X %/% 10) %>%
count(Y, Class) %>%
group_by(Y) %>%
complete(Class = seq(0, max(Class), 1)) %>%
spread(Y, n, fill = 0) %>%
mutate(Class = paste(Class * 10 + 1,
lead(Class * 10, default = ((last(Class) + 1) * 10)),
sep = "-"))
Class A B
<chr> <dbl> <dbl>
1 1-10 0 0
2 11-20 2 1
3 21-30 1 2
4 31-40 1 1
Or if you want the exact output you provided:
df %>%
mutate(Class = X %/% 10) %>%
count(Y, Class) %>%
group_by(Y) %>%
complete(Class = seq(0, max(Class), 1)) %>%
spread(Y, n, fill = 0) %>%
mutate(Class = paste0("(",
Class * 10 + 1,
"-",
lead(Class * 10, default = ((last(Class) + 1) * 10)),
")"),
Class = paste(as.roman(row_number()), Class, sep = " "))
Class A B
<chr> <dbl> <dbl>
1 I (1-10) 0 0
2 II (11-20) 2 1
3 III (21-30) 1 2
4 IV (31-40) 1 1
Or a possibility for the cases when X == 0:
df %>%
filter(X > 0) %>%
mutate(Class = X %/% 10) %>%
count(Y, Class) %>%
group_by(Y) %>%
complete(Class = seq(0, max(Class), 1)) %>%
spread(Y, n, fill = 0) %>%
mutate(Class = paste0("(",
Class * 10 + 1,
"-",
lead(Class * 10, default = ((last(Class) + 1) * 10)),
")"),
Class = paste(as.roman(row_number()), Class, sep = " "))
I'm facing this issue: I got some data like these:
library(tidyverse)
library(tidygraph)
library(ggraph)
library(ggrepel)
edges <- data.frame(a=c('k','k','k','k','k','z','z'),
b=c('b','b','b','b','c','b','c'), costant = 1)
a b costant
1 k b 1
2 k b 1
3 k b 1
4 k b 1
5 k c 1
6 z b 1
7 z c 1
Now I would lik to have a graph with ggraph that have nodes and edges with weights. So I worked this way:
# first I calculated the edges weights
edges1 <- edges%>% group_by(a,b) %>% summarise(weight = sum(costant))
> edges1
# A tibble: 4 x 3
# Groups: a [?]
a b weight
<fct> <fct> <dbl>
1 k b 4
2 k c 1
3 z b 1
4 z c 1
Then the nodes:
nodes <- rbind(data.frame(word = edges$a, n = 1),data.frame(word = edges$b, n = 1)) %>%
group_by(word) %>%
summarise(n = sum(n))
> nodes
# A tibble: 4 x 2
word n
<fct> <dbl>
1 k 5
2 z 2
3 b 5
4 c 2
Till now, everything works fine. Now, following this as example:
tidy <- tbl_graph(nodes = nodes, edges = edges1, directed = T)
tidy <- tidy %>%
activate(edges) %>%
arrange(desc(weight)
)
Suddently I plotted the graph:
ggraph(tidy, layout = "gem") +
geom_node_point(aes(size=n)) +
geom_edge_link(aes(width = weight), alpha = 0.8) +
scale_edge_width(range = c(0.2, 2)) +
geom_text_repel(aes(x = x, y=y , label=word))
But the result is this:
And I cannot figure out why there is a line between k and z, because that edges does not exists.
Thank in advance.
It seems it's due to the fact that tbl_graph converts edge1 tibble's nodes from factor to integer by as.integer without considering the nodes tibble, this is source of the error. If we pre-convert the edge node's to integers correctly it will work as expected.
edges <- data.frame(a=c('k','k','k','k','k','z','z'),
b=c('b','b','b','b','c','b','c'), costant = 1)
edges1 <- edges%>% group_by(a,b) %>% summarise(weight = sum(costant))
nodes <- rbind(data.frame(word = edges$a, n = 1),data.frame(word = edges$b, n = 1)) %>%
group_by(word) %>%
summarise(n = sum(n))
edges2 <- edges1 # save edges with factor node labels into edge2
# convert 'from' and 'to' factor columns to integer columns correctly
# with the nodes tibble's corresponding matched index values
edges1$a <- match(edges1$a, nodes$word)
edges1$b <- match(edges1$b, nodes$word)
tidy <- tbl_graph(nodes = nodes, edges = edges1, directed = T)
tidy <- tidy %>%
activate(edges) %>%
arrange(desc(weight)
)
ggraph(tidy, layout = "gem") +
geom_node_point(aes(size=n)) +
geom_edge_link(aes(width = weight), arrow = arrow(length = unit(4, 'mm')), end_cap = circle(3, 'mm'), alpha = 0.8) +
scale_edge_width(range = c(0.2, 2)) +
geom_text_repel(aes(x = x, y=y , label=word))
edges2 # compare the edges in the following tibble with the next figure
# A tibble: 4 x 3
# Groups: a [?]
a b weight
<fct> <fct> <dbl>
#1 k b 4
#2 k c 1
#3 z b 1
#4 z c 1