If I say I have a list of people who used cafeteria.
Fruits ID Date
apple 1 100510
apple 2 100710
banana 2 110710
banana 1 120910
kiwi 2 120710
apple 3 100210
kiwi 3 110810
I want to select people who have took both apple and banana and my new dataset to contain people who qualify for this inclusion criteria and give:
ID
1
2
(because only ID 1 and 2 had both apple and banana in the dataset)
what code should I use in R?
In base R you could do something like
data.frame(ID = names(which(sapply(split(df$Fruits, df$ID), function(x) {
"apple" %in% x & "banana" %in% x
}))))
#> ID
#> 1 1
#> 2 2
This will give you the name of the IDs that contain both "apple" and "banana"
If you want the subset of the data frame containing these rows you can do:
df[df$ID %in% names(which(sapply(split(df$Fruits, df$ID), function(x) {
"apple" %in% x & "banana" %in% x
}))),]
#> Fruits ID Date
#> 1 apple 1 100510
#> 2 apple 2 100710
#> 3 banana 2 110710
#> 4 banana 1 120910
#> 5 kiwi 2 120710
You can use dplyr package to check if any of the entries is an apple AND any of the entries is a banana per ID:
library(dplyr)
df <- data.frame(Fruits = c("apple", "apple", "banana", "banana", "kiwi", "apple", "kiwi"),
ID = c(1,2,2,1,2,3,3),
Date = c(100510,100710,100710,120910, 120710,100210,110810))
df %>%
group_by(ID) %>%
filter(any(Fruits == "apple") & any(Fruits == "banana")) %>%
ungroup() %>%
select(ID) %>%
distinct()
In this case, the result is
# A tibble: 2 x 1
ID
<dbl>
1 1
2 2
Related
I have a tidy data.frame in this format:
library(tidyverse)
df = data.frame(name = c("Clarence","Clarence","Clarence","Shelby","Shelby", "Patricia","Patricia"), fruit = c("Apple", "Banana", "Grapes", "Apple", "Apricot", "Banana", "Grapes"))
df
# name fruit
#1 Clarence Apple
#2 Clarence Banana
#3 Clarence Grapes
#4 Shelby Apple
#5 Shelby Apricot
#6 Patricia Banana
#7 Patricia Grapes
I want to compare the overlaps between groups in a pairwise manner (i.e. if both people have an apple that counts as an overlap of 1) so that I end up with a dataframe that looks like this:
df2 = data.frame(names = c("Clarence-Shelby", "Clarence-Patricia", "Shelby-Patricia"), n_overlap = c(1, 2, 0))
df2
# names n_overlap
#1 Clarence-Shelby 1
#2 Clarence-Patricia 2
#3 Shelby-Patricia 0
Is there an elegant way to do this in the tidyverse framework? My real dataset is much larger than this and will be grouped on multiple columns.
If the 0 overlap is not important, a solution is:
> df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
name.x name.y n
1 Clarence Patricia 2
2 Clarence Shelby 1
If you really need non-overlapping pairs:
> a = df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
> b = as.data.frame(t(combn(sort(unique(df$name,2)),2)))
> colnames(b)=colnames(a)[1:2]
> a %>% full_join(b) %>% replace_na(list(n=0))
Joining, by = c("name.x", "name.y")
name.x name.y n
1 Clarence Patricia 2
2 Clarence Shelby 1
3 Patricia Shelby 0
Try this,
combinations <- apply(combn(unique(df$name), 2), 2, function(z) paste(sort(z), collapse = "-"))
combinations
# [1] "Clarence-Shelby" "Clarence-Patricia" "Patricia-Shelby"
library(dplyr)
df %>%
group_by(fruit) %>%
summarize(names = paste(sort(unique(name)), collapse = "-")) %>%
right_join(tibble(names = combinations), by = "names") %>%
group_by(names) %>%
summarize(n_overlap = sum(!is.na(fruit)))
# # A tibble: 3 x 2
# names n_overlap
# <chr> <int>
# 1 Clarence-Patricia 2
# 2 Clarence-Shelby 1
# 3 Patricia-Shelby 0
This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Closed 2 years ago.
I have the following data:
df = data.frame(
id("anton", "anton", "charly", "charly", "klaus", "klaus"),
fruits=c("apple", "cherry", "pear", "pear", "apple", "pear"),
number=c(1,4,1,2,3,5))
id fruits number
1 anton apple 1
2 anton cherry 4
3 charly pear 1
4 charly pear 2
5 klaus apple 3
6 klaus pear 5
desired outcome:
id fruits number
1 anton apple, cherry 1, 4
2 charly pear, pear 1, 2
3 klaus apple, pear 3, 5
it works with
library(dplyr)
df.wide <- df %>%
group_by(id) %>%
summarise_all(funs(toString(na.omit(.))))
but I get the warning
"funs() is deprecated as of dplyr 0.8.0. Please use a list of either
functions or lambdas:
Simple named list:
list(mean = mean, median = median)
Auto named with tibble::lst():
tibble::lst(mean, median)
Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))".
How could I reproduce it? 'list' and then 'unnest'? (tried it, but cannot wrap my head around it how to unnest all columns)
Note that also summarise_all is depreciated. Instead you can use across together with a purrr style lambda function:
df = data.frame(
id = c("anton", "anton", "charly", "charly", "klaus", "klaus"),
fruits=c("apple", "cherry", "pear", "pear", "apple", "pear"),
number=c(1,4,1,2,3,5))
library(dplyr)
df.wide <- df %>%
group_by(id) %>%
summarise_all(funs(toString(na.omit(.))))
df.wide
#> # A tibble: 3 x 3
#> id fruits number
#> <chr> <chr> <chr>
#> 1 anton apple, cherry 1, 4
#> 2 charly pear, pear 1, 2
#> 3 klaus apple, pear 3, 5
df_new <- df %>%
group_by(id) %>%
summarise(across(everything(), ~toString(na.omit(.))))
df_new
#> # A tibble: 3 x 3
#> id fruits number
#> <chr> <chr> <chr>
#> 1 anton apple, cherry 1, 4
#> 2 charly pear, pear 1, 2
#> 3 klaus apple, pear 3, 5
Created on 2020-09-25 by the reprex package (v0.3.0)
Try using across() in combination with everything()
df %>%
group_by(id) %>%
summarise(fruits = paste(fruits, collapse = ", "),
number = paste(number, collapse = ", "))
df %>%
group_by(id) %>%
summarise(across(everything(), ~paste(., collapse = ", ")))
which yields
id fruits number
<chr> <chr> <chr>
1 anton apple, cherry 1, 4
2 charly pear, pear 1, 2
3 klaus apple, pear 3, 5
for examples on how to use these new functions, see: https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/
I have data, a simplified version of which looks like this:
df_current <- data.frame(
start = c('yes', rep('no', 5), 'yes', rep('no', 3)),
season = c('banana', rep('to update', 5), 'apple', rep('to update', 3)),
stringsAsFactors = F
)
Let's say that the "start" variable indicates when a new season starts, and I can use that in combination with a date variable (not included) to indicate where apple and banana season start. Once this is done, I want to update the rest of the rows in the "season" column. All of the rows which currently have the value "to update" should be updated to have the value of the type of fruit whose season has most recently started (the rows are arranged by date). In other words, I want the data to look like this:
df_desired <- data.frame(
start = c('yes', rep('no', 5), 'yes', rep('no', 3)),
season = c(rep('banana', 6), rep('apple', 4)),
stringsAsFactors = F
)
I had assumed that something like the following would work:
updated <- df_current %>%
rowwise() %>%
mutate(season = case_when(
season != 'to update' ~ season,
season == 'to update' ~ lag(season)
))
However, that generates NAs at all the 'to update' values.
An easy way would be to replace "to update" with NA and then use fill.
library(dplyr)
library(tidyr)
df_current %>%
mutate(season = replace(season, season == "to update", NA)) %>%
fill(season)
# start season
#1 yes banana
#2 no banana
#3 no banana
#4 no banana
#5 no banana
#6 no banana
#7 yes apple
#8 no apple
#9 no apple
#10 no apple
Using the same logic you can also use zoo::na.locf to fill missing values with latest non-missing values.
The reason you generate a bunch of NAs is due to season containing only a single value in each case_when evaluation, and thus lag(season) always producing NA. Here is another base R solution that uses rle:
x <- rle(df_current$season)
x
#> Run Length Encoding
#> lengths: int [1:4] 1 5 1 3
#> values : chr [1:4] "banana" "to update" "apple" "to update"
x$values[x$values == "to update"] <- x$values[which(x$values == "to update") - 1]
x
#> Run Length Encoding
#> lengths: int [1:4] 1 5 1 3
#> values : chr [1:4] "banana" "banana" "apple" "apple"
df_current$season <- inverse.rle(x)
df_current
#> start season
#> 1 yes banana
#> 2 no banana
#> 3 no banana
#> 4 no banana
#> 5 no banana
#> 6 no banana
#> 7 yes apple
#> 8 no apple
#> 9 no apple
#> 10 no apple
We can use na_if
library(dplyr)
library(tidyr)
df_current %>%
mutate(season = na_if(season, "to update")) %>%
fill(season)
# start season
#1 yes banana
#2 no banana
#3 no banana
#4 no banana
#5 no banana
#6 no banana
#7 yes apple
#8 no apple
#9 no apple
#10 no apple
I'm not sure what this problem is even called. Let's say I'm counting distinct combinations of 2 columns, but I want distinct across the order of the two columns. Here's what I mean:
df = data.frame(fruit1 = c("apple", "orange", "orange", "banana", "kiwi"),
fruit2 = c("orange", "apple", "banana", "orange", "apple"),
stringsAsFactors = FALSE)
# What I want: total number of fruit combinations, regardless of
# which fruit comes first and which second.
# Eg 2 apple-orange, 2 banana-orange, 1 kiwi-apple
# What I know *doesn't* work:
table(df$fruit1, df$fruit2)
# What *does* work:
library(dplyr)
df %>% group_by(fruit1, fruit2) %>%
transmute(fruitA = sort(c(fruit1, fruit2))[1],
fruitB = sort(c(fruit1, fruit2))[2]) %>%
group_by(fruitA, fruitB) %>%
summarise(combinations = n())
I've got a way to make this work, as you can see, but is there a name for this general problem? It's sort of a combinatorics problem but counting, not generating combinations. And what if I had three or four columns of similar type? The above method is poorly generalizable. Tidyverse approaches most welcome!
By using apply and sort order your dataframe then we just using group_by count
data.frame(t(apply(df,1,sort)))%>%group_by_all(.)%>%count()
# A tibble: 3 x 3
# Groups: X1, X2 [3]
X1 X2 n
<fctr> <fctr> <int>
1 apple kiwi 1
2 apple orange 2
3 banana orange 2
Here is an option using pmap with count
library(tidyverse)
library(rlang)
pmap_df(df, ~ sort(c(...)) %>%
as.list %>%
as_tibble %>%
set_names(names(df))) %>%
count(!!! rlang::syms(names(.)))
# A tibble: 3 x 3
# fruit1 fruit2 n
# <chr> <chr> <int>
#1 apple kiwi 1
#2 apple orange 2
#3 banana orange 2
I'm starting with the following data:
df <- data.frame(Person=c("Ada","Ada","Bob","Bob","Carl","Carl"), Day=c(1,2,2,1,1,2), Fruit=c("Apple","X","Apple","X","X","Orange"))
Person Day Fruit
1 Ada 1 Apple
2 Ada 2 X
3 Bob 2 Apple
4 Bob 1 X
5 Carl 1 X
6 Carl 2 Orange
And I want to loop through every person and replace the unknown fruit X with either Apple or Orange while making sure that if it's Orange one day, it should be Apple the next day, and vice versa.
For Ada: Day 1 = Apple, meaning Day 2 = X <- Orange
I don't know where to start other than:
library(dplyr)
df %>%
group_by(Person)
any suggestions for direction?
Another solution using case_when from dplyr:
library(dplyr)
# Changing datatypes to character instead of factor
df[] <- lapply(df, as.character)
# Optional, but this line will convert all columns to appropriate datatype, eg. Day will be integer
df <- readr::type_convert(df)
df %>%
group_by(Person) %>%
mutate(
Contains_Apple = any(Fruit == "Apple"),
Contains_Orange = any(Fruit == "Orange"),
Fruit = case_when(
Fruit == "X" & Contains_Apple == F ~ "Apple",
Fruit == "X" & Contains_Orange == F ~ "Orange",
TRUE ~ Fruit
)
)
# A tibble: 6 x 5
# Groups: Person [3]
Person Day Fruit Contains_Apple Contains_Orange
<chr> <int> <chr> <lgl> <lgl>
1 Ada 1 Apple T F
2 Ada 2 Orange T F
3 Bob 2 Apple T F
4 Bob 1 Orange T F
5 Carl 1 Apple F T
6 Carl 2 Orange F T
Remove the Contains_Apple and Contains_Orange by:
df %>%
group_by(Person) %>%
mutate(Contains_Apple = any(Fruit == "Apple"),
Contains_Orange = any(Fruit == "Orange"),
Fruit = case_when(Fruit == "X" & Contains_Apple == F ~ "Apple",
Fruit == "X" & Contains_Orange == F ~ "Orange",
TRUE ~ Fruit)) %>%
select(Person, Day, Fruit) %>%
ungroup()
# A tibble: 6 x 3
Person Day Fruit
<chr> <int> <chr>
1 Ada 1 Apple
2 Ada 2 Orange
3 Bob 2 Apple
4 Bob 1 Orange
5 Carl 1 Apple
6 Carl 2 Orange
Here is one idea using case_when to check if each group already has "Apple" or "Orange", and then assign the opposite value if Fruit is "X".
Notice that I added stringsAsFactors = FALSE when creating the example data frame, which aims to avoid the creation of factor columns.
library(dplyr)
library(tidyr)
df %>%
group_by(Person) %>%
mutate(Fruit = case_when(
Fruit %in% "X" & any(Fruit %in% "Apple") ~ "Orange",
Fruit %in% "X" & any(Fruit %in% "Orange") ~ "Apple",
TRUE ~ Fruit
)) %>%
ungroup()
# # A tibble: 6 x 3
# Person Day Fruit
# <chr> <dbl> <chr>
# 1 Ada 1.00 Apple
# 2 Ada 2.00 Orange
# 3 Bob 2.00 Apple
# 4 Bob 1.00 Orange
# 5 Carl 1.00 Apple
# 6 Carl 2.00 Orange
DATA
df <- data.frame(Person=c("Ada","Ada","Bob","Bob","Carl","Carl"),
Day=c(1,2,2,1,1,2),
Fruit=c("Apple","X","Apple","X","X","Orange"),
stringsAsFactors = FALSE)
Simple with looping:
fruity_loop <- function(frame) {
ops <- c('Apple', 'Orange')
for(x in 1:nrow(frame)) {
if(frame[x,]['Fruit'] == 'X') {
if(frame[x-1,]['Fruit'] == ops[1]) { frame[x,]['Fruit'] <- ops[2] } else { frame[x,]['Fruit'] <- ops[1] } }
}
return(frame)
}
Example:
fruity_loop(df)