Find the overlap of two datasets - r

I have two different datasets as I've shown below: df_A and df_B.
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
Now, I want to see the overlap of these two datasets on book_name. Namely, I want to make a list that shows us the book_name that are both in the datasets and also how similar these two datasets according to the book_name column.
Is there any idea to do this in an accurate way?

You can do an inner join between the two dataframes which automatically gives you the intersection between the two dataframes.
This should do the trick,
library(dplyr)
# Creating first data frame
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
# Creating second data frame
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
# Joining between the two dataframes to get the common values between the two
result <-
df_A %>%
inner_join(df_B, by = "book_name")

Here is a base R solution, where maybe you can use intersect(), i.e.,
overlap <- subset(df_A,book_name %in% intersect(book_name,df_B$book_name))
such that
> overlap
# A tibble: 3 x 2
book_name sales_id
<chr> <dbl>
1 A 1
2 C 3
3 E 5

Related

Separate entries in dataframe in new rows in R [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 28 days ago.
I have data.frame df below.
df <- data.frame(id = c(1:12),
A = c("alpha", "alpha", "beta", "beta", "gamma", "gamma", "gamma", "delta",
"epsilon", "epsilon", "zeta", "eta"),
B = c("a", "a; b", "a", "c; d; e", "e", "e", "c; f", "g", "a", "g; h", "f", "d"),
C = c(NA, 4, 2, 7, 4, NA, 9, 1, 1, NA, 3, NA),
D = c("ii", "ii", "i", "iii", "iv", "v", "viii", "v", "viii", "i", "iii", "i"))
Column 'B' contains four entries with semicolons. How can I copy each of these rows and enter in column 'B' each of the separate values?
The expected result df2 is:
df2 <- data.frame(id = c(1, 2, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 9, 10, 10, 11, 12),
A = c(rep("alpha", 3), rep("beta", 4), rep("gamma", 4), "delta", rep("epsilon", 3),
"zeta", "eta"),
B = c("a", "a", "b", "a", "c", "d", "e", "e", "e", "c", "f", "g", "a", "g", "h", "f", "d"),
C = c(NA, 4, 4, 2, 7, 7, 7, 4, NA, 9, 9, 1, 1, NA, NA, 3, NA),
D = c("ii", "ii", "ii", "i", "iii", "iii", "iii", "iv", "v", "viii", "viii", "v", "viii", "i", "i", "iii", "i"))
I tried this, but no luck:
df2 <- df
# split the values in column B
df2$B <- unlist(strsplit(as.character(df2$B), "; "))
# repeat the rows for each value in column B
df2 <- df2[rep(seq_len(nrow(df2)), sapply(strsplit(as.character(df1$B), "; "), length)),]
# match the number of rows in column B with the number of rows in df2
df2$id <- rep(df2$id, sapply(strsplit(as.character(df1$B), "; "), length))
# sort the dataframe by id
df2 <- df2[order(df2$id),]
We may use separate_rows here - specify the sep as ; followed by zero or more spaces (\\s*) to expand the rows
library(tidyr)
df_new <- separate_rows(df, B, sep = ";\\s*")
-checking with OP's expected
> all.equal(df_new, df2, check.attributes = FALSE)
[1] TRUE
In the base R, we may replicate the sequence of rows by the lengths of the list output
lst1 <- strsplit(df$B, ";\\s+")
df_new2 <- transform(df[rep(seq_len(nrow(df)), lengths(lst1)),], B = unlist(lst1))
row.names(df_new2) <- NULL

Tidyverse: group_by, arrange, and lag across columns

I am working on a projection model for sports where I need to understand in a certain team's most recent game:
Who is their next opponent? (solved)
When is the last time their next opponent played?
reprex that can be used below. Using row 1 as an example, I would need to understand that "a"'s next opponent "e"'s most recent game was game_id_ 3.
game_id_ <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
game_date_ <- c(rep("2021-01-29", 6), rep("2021-01-30", 6))
team_ <- c("a", "b", "c", "d", "e", "f", "b", "c", "d", "f", "e", "a")
opp_ <- c("b", "a", "d", "c", "f", "e", "c", "b", "f", "d", "a", "e")
df <- data.frame(game_id_, game_date_, team_, opp_)
#Next opponent
df <- df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L))
If I can provide more details, please let me know.
We can use match to return the corresponding game_id_
library(dplyr)
df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L)) %>%
ungroup %>%
mutate(last_time = game_id_[match(next_opp, opp_)])

numbering characters in a string

I want to number the letters in a large dataset. Some letters occur multiple times and are numbered ("A1", "A2"), others also occur multiple times but are not numbered. There are also letters that occur only once... but maybe it's easier to look at the example data below.
The numbers in df$nr are the desired result. How can I get df$nr from df$word and df$letter ?
df <-tibble(word=c(rep("Amamam", 17), rep("Bobob", 14)),
letter=c("A1", "A1", "A1", "A1", "A2", "A2", "m", "m", "m", "a", "a", "m", "m", "a", "a", "m", "m",
"B1", "B1", "B2", "B2", "B3", "B3", "o", "b", "b", "b", "o", "o", "o", "b"),
nr=c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6,
1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 5) )
We can group by 'word', remove the numeric part from the 'letter' column, convert to run-length-id (rleid from data.table)
library(dplyr)
library(stringr)
library(data.table)
df1 <- df %>%
group_by(word) %>%
mutate(nr1 = rleid(str_remove(letter, "\\d+")))
all.equal(df1$nr, df1$nr1)
#[1] TRUE

R graph reorder a factor by levels for only a specified level

I am trying to create a graph where the x axis (a factor) is reordered by descending order of the y axis (numerical values), but only for one of two levels of another factor.
Originally, I tried using the code below:
reorder(factor1, desc(value1))
However, this code only reorganizes the graph (in a descending order) by the sum of the two values under each factor2 (I presume); while I am only interested in reorganizing the data for one level (i.e. "A") under factor2.
Here is some sample data to illustrate better.
sampledata <- data.frame(factor1 = c("A", "A", "B", "B", "C", "C", "D", "D", "E", "E",
"F", "F", "G", "G", "H", "H", "I", "I", "J", "J"),
factor2 = c("A", "H", "A", "H", "A", "H", "A", "H", "A", "H",
"A", "H", "A", "H", "A", "H", "A", "H", "A", "H"),
value1 = c(1, 5, 6, 2, 6, 8, 10, 21, 30, 5,
3, 5, 4, 50, 4, 7, 15, 48, 20, 21))
Here is what I used previously:
sampledata %>%
ggplot(aes(x=reorder(factor1, desc(value1)), y=value1, group=factor2, color=factor2)) +
geom_point()
The reason why I would like to reorder by a specific level (say factor2=="A") is that I can view any deviance of the values for factor2=="H" away from "A" points.
I would appreciate using tidyverse or dplyr as means to solve this problem.
library(ggplto2)
library(dplyr)
sampledata %>%
mutate(value2 = +(factor2=="A")*value1) %>%
ggplot(aes(x=reorder(factor1, desc(value2 + value1/max(value1))), y=value1,
group=factor2, color=factor2)) +
geom_point() +
xlab("factor1")

Most elegant way to convert lists into igraph object for plotting

I am new to igraph and it seems to be a very powerful (and therefore also complex) package.
I tried to convert the following lists into an igraph object.
graph <- list(s = c("a", "b"),
a = c("s", "b", "c", "d"),
b = c("s", "a", "c", "d"),
c = c("a", "b", "d", "e", "f"),
d = c("a", "b", "c", "e", "f"),
e = c("c", "d", "f", "z"),
f = c("c", "d", "e", "z"),
z = c("e", "f"))
weights <- list(s = c(3, 5),
a = c(3, 1, 10, 11),
b = c(5, 3, 2, 3),
c = c(10, 2, 3, 7, 12),
d = c(15, 7, 2, 11, 2),
e = c(7, 11, 3, 2),
f = c(12, 2, 3, 2),
z = c(2, 2))
Interpretation is as follows: s is the starting node, it links to nodes a and b. The edges are weighted 3 for s to a and 5 for s to b and so on.
I tried all kinds of functions from igraph but only got all kinds of errors. What is the most elegant and easy way to convert the above into an igraph object for plotting the graph?
Create an edgelist and then a graph from that. Assign the weights and plot it.
set.seed(123)
e <- as.matrix(stack(graph))
g <- graph_from_edgelist(e)
E(g)$weight <- stack(weights)[[1]]
plot(g, edge.label = E(g)$weight)

Resources