dply filter with exception - r

So I'm trying to filter out certain things in my dataset.
Here's a really parred down example of my dataset:
fish <- data.frame ("order"=c("a", "a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
"family"= c("r", "s", "t", "r", "y", "y", "y", "u", "y", "u", "y"),
"species"=c(7, 8, 9, 6, 5, 4, 3, 10, 1, 11, 2))
so I have
fish <- fish%>%
filter(
!(order %in% c("a", "b", "c"))&
!(family %in% c("r","s","t","u"))
)
which should remove all orders in a,b,c and all families in , r, s, t, u. Leaving me with
order family species
d y 10
e y 11
But the issue is, there are two species that are in families that I am filtering out. So say species 1 is in family "r". I want species 1 to stay in the dataset, while filtering all the rest of family r. So I want the output to look like:
order family species
d y 10
e y 11
d r 1
e r 2
How can I make sure that when I'm filtering out the groups of family, it keeps these two species?
Thanks!

You could rbind the results of three separate filters:
temp1<-filter(fish,order!=c("a","b","c")&family!=c("r","s","t","u"))
temp2<-filter(fish,family=="r"&species==1)
temp3<-filter(fish,family=="s"&species==2)
fish<-rbind(temp1,temp2,temp3)
rm(temp1,temp2,temp3)

It would be most natural to have the filtering process mirror your logic --
Filter #1: filter-out undesirable order and family
Filter #2: filter desirable family, species pairs
Note: I had to change your family, species pair criteria to get matches.
library(dplyr)
library(purrr)
# your example data
fish <- tibble ("order"=c("a", "a", "a", "b", "b", "c", "c", "d", "d", "e", "e"),
"family"= c("r", "s", "t", "r", "y", "y", "y", "u", "y", "u", "y"),
"species"=c(7, 8, 9, 6, 5, 4, 3, 10, 1, 11, 2))
# put filter criteria in variables
order_filter <- c('a', 'b', 'c')
family_filter <- c('r', 's', 't', 'u')
# Filter 1
df1 <- fish %>%
filter(!order %in% order_filter,
!family %in% family_filter)
# Filter 2
df2 <- map_df(.x = list(c('r', 7), c('s', 8)),
.f = function(x) {fish %>%
filter(family == x[1], species == x[2])})
# Combine two data frames created by Filter 1 and Filter 2
df_final <- bind_rows(df1, df2)
print(df_final)
# A tibble: 4 x 3
# order family species
# <chr> <chr> <dbl>
# 1 d y 1
# 2 e y 2
# 3 a r 7
# 4 a s 8

Related

Counting the occurrence of a word but only once per row (R)

I want to count the number of times a word appears but only once per row. How do I complete the code?
library(stringr)
var1 <- c("x", "x", "x", "x", "x", "x", "y", "y", "y", "y")
var2 <- c("x", "x", "b", "b", "c", "d", "e", "y", "g", "h")
var3 <- c("x", "x", "b", "b", "c", "d", "e", "y", "g", "h")
data <- data.frame(cbind(var1, var2, var3))
sum(str_count(data, "x"))
The result should be 6.
The following should do the trick:
sum(rowSums(data == "x") >= 1) # Thanks Maƫl
# [1] 6
which will check if there is at least one value per row (rowSums()) and add all the rows with 1+ up using sum()
Or alternatively (per Antreas's comment so it is not missed):
length(which(rowSums(data == "x") != 0)) # Thanks to Antreas Stefopoulos
Which counts the number of non-zero rows with length()

Tidyverse: group_by, arrange, and lag across columns

I am working on a projection model for sports where I need to understand in a certain team's most recent game:
Who is their next opponent? (solved)
When is the last time their next opponent played?
reprex that can be used below. Using row 1 as an example, I would need to understand that "a"'s next opponent "e"'s most recent game was game_id_ 3.
game_id_ <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
game_date_ <- c(rep("2021-01-29", 6), rep("2021-01-30", 6))
team_ <- c("a", "b", "c", "d", "e", "f", "b", "c", "d", "f", "e", "a")
opp_ <- c("b", "a", "d", "c", "f", "e", "c", "b", "f", "d", "a", "e")
df <- data.frame(game_id_, game_date_, team_, opp_)
#Next opponent
df <- df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L))
If I can provide more details, please let me know.
We can use match to return the corresponding game_id_
library(dplyr)
df %>%
arrange(game_date_, game_id_, team_) %>%
group_by(team_) %>%
mutate(next_opp = lead(opp_, n = 1L)) %>%
ungroup %>%
mutate(last_time = game_id_[match(next_opp, opp_)])

How to find the similarity in R?

I have a data set as I've shown below:
It shows which book is sold by which shop.
df <- tribble(
~shop, ~book_id,
"A", 1,
"B", 1,
"C", 2,
"D", 3,
"E", 3,
"A", 3,
"B", 4,
"C", 5,
"D", 1,
)
In the data set,
shop A sells 1, 3
shop B sells 1, 4
shop C sells 2, 5
shop D sells 3, 1
shop E sells only 3
So now, I want to calculate the Jaccard index here. For instance, let's take shop A and shop B. There are three different books that are sold by A and B (book 1, book 3, book 4). However, only one product is sold by both shops (this is product 1). So, the Jaccard index here should be 33.3% (1/3).
Here is the sample of the desired data:
df <- tribble(
~shop_1, ~shop_2, ~similarity,
"A", "B", 33.3,
"B", "A", 33.33,
"A", "C", 0,
"C", "A", 0,
"A", "D", 100,
"D", "A", 100,
"A", "E", 50,
"E", "A", 50,
)
Any comments/assistance really appreciated! Thanks in advance.
I don't know about a package but you can write your own function. I guess by similarity you mean something like this:
similarity <- function(x, y) {
k <- length(intersect(x, y))
n <- length(union(x, y))
k / n
}
Then you can use tidyr::crossing to merge the same data frame with itself
dfg <- df %>% group_by(shop) %>% summarise(books = list(book_id))
crossing(dfg %>% set_names(paste0, "_A"), dfg %>% set_names(paste0, "_B")) %>%
filter(shop_A != shop_B) %>%
mutate(similarity = map2_dbl(books_A, books_B, similarity))

Find the overlap of two datasets

I have two different datasets as I've shown below: df_A and df_B.
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
Now, I want to see the overlap of these two datasets on book_name. Namely, I want to make a list that shows us the book_name that are both in the datasets and also how similar these two datasets according to the book_name column.
Is there any idea to do this in an accurate way?
You can do an inner join between the two dataframes which automatically gives you the intersection between the two dataframes.
This should do the trick,
library(dplyr)
# Creating first data frame
df_A <- tribble(
~book_name, ~sales_id,
"A", 1,
"B", 2,
"C", 3,
"D", 4,
"E", 5,
"F", 3,
"G", 8,
"H", 6,
"I", 7,
"J", 7,
)
# Creating second data frame
df_B <- tribble(
~book_name, ~sales_id,
"A", 1,
"N", 2,
"C", 3,
"E", 4,
"K", 5,
"R", 3,
"S", 8,
"U", 6,
"Z", 7,
"Y", 7,
)
# Joining between the two dataframes to get the common values between the two
result <-
df_A %>%
inner_join(df_B, by = "book_name")
Here is a base R solution, where maybe you can use intersect(), i.e.,
overlap <- subset(df_A,book_name %in% intersect(book_name,df_B$book_name))
such that
> overlap
# A tibble: 3 x 2
book_name sales_id
<chr> <dbl>
1 A 1
2 C 3
3 E 5

How to group data, with restrictions on group size in R

Given a data frame, I can group the rows under a stated property, count them to know the size of the group and assign them uniquely with an id number. But what I really need is to do this process so that the group sizes are restricted under the following three conditions:
If size modulo 3 = 0, then split into smaller groups all of size 3,
If size modulo 3 = 1, then split into smaller groups of size 3 and two groups of size 2.
If size modulo 3 = 2, then split into smaller groups of size 3 and one of size 2
Hence if size is 4 then create two groups, both of size 2; whereas when size is 5, then split into two groups of size 3 and 2.
I have created the following minimal example.
This is the starting data. Typically, it would not be ordered and could have more columns:
structure(
list(property = c("A", "B", "B", "C", "C", "C", "D", "D", "D", "D", "E", "E", "E", "E", "E", "F", "F", "F", "F", "F", "F", "G", "G", "G", "G", "G", "G", "G")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -28L),
.Names = "property"
)
The desired output would be:
structure(
list(property = c("A", "B", "B", "C", "C", "C", "D", "D", "D", "D", "E", "E", "E", "E", "E", "F", "F", "F", "F", "F", "F", "G", "G", "G", "G", "G", "G", "G"),
id = c(1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 11, 11, 12, 12)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -28L),
.Names = c("property", "id")
)
The order of the groups is not important.
I first create a function that will create groups of equal numbers according to your requirement. Basically, it will always create groups of three equal numbers and then cut off those numbers that are too much at the end. In the special case the last group has length one, the last but one element is replaced by the last one in order to satisfy your condition 2:
create_grp_idx <- function(x) {
n <- length(x)
m <- n %/% 3 + 1
idx <- rep(1:m, each = 3)[1:n]
if (n %% 3 == 1 && n > 1) idx[n-1] <- idx[n]
return (idx)
}
Now I use dplyr to group the data by property and then apply create_grp_idx() to each group, thus creating the index n. I then use interaction() to get a factor from each combination of property and the newly created index n. Since you use numbers in your example, I convert the factor to numeric and finally remove the column with the index n.
library(dplyr)
group_by(data, property) %>%
mutate(n = create_grp_idx(property)) %>%
ungroup %>%
mutate(id = as.numeric(interaction(property, n))) %>%
select(-n)
## Source: local data frame [28 x 2]
##
## property id
## (chr) (dbl)
## 1 A 1
## 2 B 2
## 3 B 2
## 4 C 3
## 5 C 3
## 6 C 3
## 7 D 4
## 8 D 4
## 9 D 11
## 10 D 11
## .. ... ...
This does not give exactly the example output you gave, but since you said that the order of the groups is irrelevant, I assume that this is the result that you want.

Resources