How to find the similarity in R? - r

I have a data set as I've shown below:
It shows which book is sold by which shop.
df <- tribble(
~shop, ~book_id,
"A", 1,
"B", 1,
"C", 2,
"D", 3,
"E", 3,
"A", 3,
"B", 4,
"C", 5,
"D", 1,
)
In the data set,
shop A sells 1, 3
shop B sells 1, 4
shop C sells 2, 5
shop D sells 3, 1
shop E sells only 3
So now, I want to calculate the Jaccard index here. For instance, let's take shop A and shop B. There are three different books that are sold by A and B (book 1, book 3, book 4). However, only one product is sold by both shops (this is product 1). So, the Jaccard index here should be 33.3% (1/3).
Here is the sample of the desired data:
df <- tribble(
~shop_1, ~shop_2, ~similarity,
"A", "B", 33.3,
"B", "A", 33.33,
"A", "C", 0,
"C", "A", 0,
"A", "D", 100,
"D", "A", 100,
"A", "E", 50,
"E", "A", 50,
)
Any comments/assistance really appreciated! Thanks in advance.

I don't know about a package but you can write your own function. I guess by similarity you mean something like this:
similarity <- function(x, y) {
k <- length(intersect(x, y))
n <- length(union(x, y))
k / n
}
Then you can use tidyr::crossing to merge the same data frame with itself
dfg <- df %>% group_by(shop) %>% summarise(books = list(book_id))
crossing(dfg %>% set_names(paste0, "_A"), dfg %>% set_names(paste0, "_B")) %>%
filter(shop_A != shop_B) %>%
mutate(similarity = map2_dbl(books_A, books_B, similarity))

Related

Tidyverse: group_by, arrange, and lag across columns

I am working on a projection model for sports where I need to understand in a certain team's most recent game:
Who is their next opponent? (solved)
When is the last time their next opponent played?
reprex that can be used below. Using row 1 as an example, I would need to understand that "a"'s next opponent "e"'s most recent game was game_id_ 3.
game_id_ <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
game_date_ <- c(rep("2021-01-29", 6), rep("2021-01-30", 6))
team_ <- c("a", "b", "c", "d", "e", "f", "b", "c", "d", "f", "e", "a")
opp_ <- c("b", "a", "d", "c", "f", "e", "c", "b", "f", "d", "a", "e")
df <- data.frame(game_id_, game_date_, team_, opp_)
#Next opponent
df <- df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L))
If I can provide more details, please let me know.
We can use match to return the corresponding game_id_
library(dplyr)
df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L)) %>%
ungroup %>%
mutate(last_time = game_id_[match(next_opp, opp_)])

Aggregate the data in R

I have a data set that is shown below:
library(tidyverse)
data <- tribble(
~category, ~product_id,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
)
And now, I want to group it by the "category" variable, keep the "product_id" and add a new variable that counts the categories:
aggregated_data <- tribble(
~category, ~product_id, ~numberOfcategory
"A", 10, 5,
"B", 20, 3,
"C", 30, 2,
)
I already got the "numberOfcategory" with this code:
data %>%
group_by(category) %>%
tally(sort=TRUE)
But somehow I could not keep the product_id.
Could someone help me to get the dataframe (aggregated_data)? Thanks in advance.
You were close! Just also group by product_id as follows:
data %>%
group_by(category,product_id) %>%
tally(sort=TRUE)

Most elegant way to convert lists into igraph object for plotting

I am new to igraph and it seems to be a very powerful (and therefore also complex) package.
I tried to convert the following lists into an igraph object.
graph <- list(s = c("a", "b"),
a = c("s", "b", "c", "d"),
b = c("s", "a", "c", "d"),
c = c("a", "b", "d", "e", "f"),
d = c("a", "b", "c", "e", "f"),
e = c("c", "d", "f", "z"),
f = c("c", "d", "e", "z"),
z = c("e", "f"))
weights <- list(s = c(3, 5),
a = c(3, 1, 10, 11),
b = c(5, 3, 2, 3),
c = c(10, 2, 3, 7, 12),
d = c(15, 7, 2, 11, 2),
e = c(7, 11, 3, 2),
f = c(12, 2, 3, 2),
z = c(2, 2))
Interpretation is as follows: s is the starting node, it links to nodes a and b. The edges are weighted 3 for s to a and 5 for s to b and so on.
I tried all kinds of functions from igraph but only got all kinds of errors. What is the most elegant and easy way to convert the above into an igraph object for plotting the graph?
Create an edgelist and then a graph from that. Assign the weights and plot it.
set.seed(123)
e <- as.matrix(stack(graph))
g <- graph_from_edgelist(e)
E(g)$weight <- stack(weights)[[1]]
plot(g, edge.label = E(g)$weight)

Kruskal-Wallis test: create lapply function to subset data.frame?

I have a data set of values (val) grouped by multiple categories (distance & phase). I would like to test each category by Kruskal-Wallis test, where val is dependent variable, distance is a factor, and phase split my data in 3 groups.
As such, I need to specify the subset of the data within Kruskal-Wallis test and then apply the test to each of groups. BUT, I can not get my subsetting to work!
In R help, it is specified that the subset is an optional vector specifying a subset of observations to be used. But how to correctly put this to my lapply function?
My dummy data:
# create data
val<-runif(60, min = 0, max = 100)
distance<-floor(runif(60, min=1, max=3))
phase<-rep(c("a", "b", "c"), 20)
df<-data.frame(val, distance, phase)
# get unique groups
ii<-unique(df$phase)
# get basic statistics per group
aggregate(val ~ distance + phase, df, mean)
# run Kruskal test, specify the subset
kruskal.test(df$val ~df$distance,
subset = phase == "c")
This works well, so my subset should be correctly set as a vector.
But how to use this in a lapply function?
# DOES not work!!
lapply(ii, kruskal.test(df$val ~ df$distance,
subset = df$phase == as.character(ii)))
My overall goal is to create a function from kruskal.test, and save all statistics for each group into one table.
All help is highly appreciated.
Usually you would start by splitting, and then lapplying.
Something like
lapply(split(df, df$phase), function(d) { kruskal.test(val ~ distance, data=d) })
would yield a list, indexed by the phase, of the results of kruskal.test.
Your final expression does not work because lapply expects a function, and applying kruskal.test does not result in a function, it results in the result of running that test. If you surround it with a function definition with the index, then it would work, just be a little less idiomatic.
lapply(ii, function(i) { kruskal.test(df$val ~ df$distance, subset=df$phase==i )})
Though it is late, it might help someone having the same problem. So, I am putting an answer implemented using tidyverse and rstatix packages. The rstatix package which "provides a simple and intuitive pipe friendly framework, coherent with the 'tidyverse' design philosophy for performing basic statistical tests".
library(rstatix)
library(tidyverse)
df %>%
group_by(phase) %>%
kruskal_test(val ~ distance)
Output
# A tibble: 3 x 7
phase .y. n statistic df p method
* <chr> <chr> <int> <dbl> <int> <dbl> <chr>
1 a val 20 0.230 1 0.631 Kruskal-Wallis
2 b val 20 0.0229 1 0.88 Kruskal-Wallis
3 c val 20 0.322 1 0.570 Kruskal-Wallis
which is same as provided by #user295691.
Data
df = structure(list(val = c(93.8056977232918, 31.0681172646582, 40.5262873973697,
47.6368983509019, 65.23181500379, 64.4571609096602, 10.3301600087434,
90.4661140637472, 41.2359046051279, 28.3357713604346, 49.8977075796574,
10.8744730940089, 5.31001624185592, 71.9248640118167, 99.0267782937735,
73.7928744405508, 3.31214582547545, 40.2693636715412, 27.6980920461938,
79.501334275119, 60.5167196830735, 89.9171086261049, 87.4633299885318,
43.1893823202699, 91.1248738644645, 99.755659350194, 7.25280269980431,
96.957387868315, 75.0860505970195, 52.3794749286026, 26.6221587313339,
52.5518182432279, 24.1361060412601, 49.5364486705512, 65.5214034719393,
38.9469220302999, 0.687191751785576, 19.3090825574473, 19.6511475136504,
25.5966754630208, 7.33999472577125, 33.9820940745994, 50.3751677693799,
10.811762069352, 17.2359711956233, 53.958406439051, 64.2723652534187,
92.7404976682737, 26.824192632921, 30.0975760444999, 52.0105463219807,
74.4495407678187, 56.0636054025963, 91.891074879095, 14.0827904455364,
59.3607738381252, 66.5170294465497, 24.1726311156526, 83.0881901318207,
35.5380675755441), distance = c(2, 1, 1, 1, 1, 2, 1, 2, 2, 1,
2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1,
1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1,
1, 2, 1, 1, 2, 2, 2, 2), phase = c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b",
"c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c")), class = "data.frame", row.names = c(NA, -60L))

How to group data, with restrictions on group size in R

Given a data frame, I can group the rows under a stated property, count them to know the size of the group and assign them uniquely with an id number. But what I really need is to do this process so that the group sizes are restricted under the following three conditions:
If size modulo 3 = 0, then split into smaller groups all of size 3,
If size modulo 3 = 1, then split into smaller groups of size 3 and two groups of size 2.
If size modulo 3 = 2, then split into smaller groups of size 3 and one of size 2
Hence if size is 4 then create two groups, both of size 2; whereas when size is 5, then split into two groups of size 3 and 2.
I have created the following minimal example.
This is the starting data. Typically, it would not be ordered and could have more columns:
structure(
list(property = c("A", "B", "B", "C", "C", "C", "D", "D", "D", "D", "E", "E", "E", "E", "E", "F", "F", "F", "F", "F", "F", "G", "G", "G", "G", "G", "G", "G")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -28L),
.Names = "property"
)
The desired output would be:
structure(
list(property = c("A", "B", "B", "C", "C", "C", "D", "D", "D", "D", "E", "E", "E", "E", "E", "F", "F", "F", "F", "F", "F", "G", "G", "G", "G", "G", "G", "G"),
id = c(1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 11, 11, 12, 12)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -28L),
.Names = c("property", "id")
)
The order of the groups is not important.
I first create a function that will create groups of equal numbers according to your requirement. Basically, it will always create groups of three equal numbers and then cut off those numbers that are too much at the end. In the special case the last group has length one, the last but one element is replaced by the last one in order to satisfy your condition 2:
create_grp_idx <- function(x) {
n <- length(x)
m <- n %/% 3 + 1
idx <- rep(1:m, each = 3)[1:n]
if (n %% 3 == 1 && n > 1) idx[n-1] <- idx[n]
return (idx)
}
Now I use dplyr to group the data by property and then apply create_grp_idx() to each group, thus creating the index n. I then use interaction() to get a factor from each combination of property and the newly created index n. Since you use numbers in your example, I convert the factor to numeric and finally remove the column with the index n.
library(dplyr)
group_by(data, property) %>%
mutate(n = create_grp_idx(property)) %>%
ungroup %>%
mutate(id = as.numeric(interaction(property, n))) %>%
select(-n)
## Source: local data frame [28 x 2]
##
## property id
## (chr) (dbl)
## 1 A 1
## 2 B 2
## 3 B 2
## 4 C 3
## 5 C 3
## 6 C 3
## 7 D 4
## 8 D 4
## 9 D 11
## 10 D 11
## .. ... ...
This does not give exactly the example output you gave, but since you said that the order of the groups is irrelevant, I assume that this is the result that you want.

Resources