Filter by combination of (row) pairs - r

I have a dataframe in a long format and I want to filter pairs based on unique combinations of values. I have a dataset that looks like this:
id <- rep(1:4, each=2)
type <- c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
df <- data.frame(id,type)
df
id type
1 1 blue
2 1 blue
3 2 red
4 2 yellow
5 3 blue
6 3 red
7 4 red
8 4 yellow
Let's say each id is a respondent and type is a combination of treatments. Individual 1 saw two objects, both of them blue; individual 2 saw one red object and a yellow one; and so on.
How do I keep, for example, those that saw the combination "red" and "yellow"? If I filter by the combination "red" and "yellow" the resulting dataset should look like this:
id type
3 2 red
4 2 yellow
7 4 red
8 4 yellow
It should keep respondents number 2 and number 4 (only those that saw the combination "red" and "yellow"). Note that it does not keep respondent number 3 because she saw "blue" and "red" (instead of "red" and "yellow"). How do I do this?
One solution is to reshape the dataset into a wide format, filter it by column, and restack again. But I am sure there is another way to do it without reshaping the dataset. Any idea?

A dplyr solution would be:
library(dplyr)
df <- data_frame(
id = rep(1:4, each = 2),
type = c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
)
types <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(types %in% type))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow
Update
Allowing for the equal combinations, e.g. blue, blue, we have to change the filter-call to the following:
types2 <- c("blue", "blue")
df %>%
group_by(id) %>%
filter(sum(types2 == type) == length(types2))
#> # A tibble: 2 x 2
#> # Groups: id [1]
#> id type
#> <int> <chr>
#> 1 1 blue
#> 2 1 blue
This solution also allows different types
df %>%
group_by(id) %>%
filter(sum(types == type) == length(types))
#> # A tibble: 4 x 2
#> # Groups: id [2]
#> id type
#> <int> <chr>
#> 1 2 red
#> 2 2 yellow
#> 3 4 red
#> 4 4 yellow

Let's use all() to see if all rows within group match a set of values.
library(tidyverse)
test_filter <- c("red", "yellow")
df %>%
group_by(id) %>%
filter(all(test_filter %in% type))
# A tibble: 4 x 2
# Groups: id [2]
id type
<int> <fctr>
1 2 red
2 2 yellow
3 4 red
4 4 yellow

I modified your data and did the following.
df <- data.frame(id = rep(1:4, each=3),
type <- c("blue", "blue", "green", "red", "yellow", "purple",
"blue", "orange", "yellow", "yellow", "pink", "red"),
stringsAsFactors = FALSE)
id type
1 1 blue
2 1 blue
3 1 green
4 2 red
5 2 yellow
6 2 purple
7 3 blue
8 3 orange
9 3 yellow
10 4 yellow
11 4 pink
12 4 red
As you see, there are three observations for each id. id 2 and 4 have both red and yellow. They also have non-target colors (i.e., purple, and pink). I wanted to preserve these observations. In order to achieve this task, I wrote the following code. The code can be read like this. "For each id, check if there is any red and yellow using any(). When both conditions are TRUE, keep all rows for the id."
group_by(df, id) %>%
filter(any(type == "yellow") & any(type == "red"))
id type
4 2 red
5 2 yellow
6 2 purple
10 4 yellow
11 4 pink
12 4 red

Using data.table:
library(data.table)
setDT(df)
df[, type1 := shift(type, type = "lag"), by = id]
df1 <- df[type == "yellow" & type1 == "red", id]
df <- df[id %in% df1, ]
df[, type1 := NULL]
It gives:
id type
1: 2 red
2: 2 yellow
3: 4 red
4: 4 yellow

Related

Using R/dplyr to filter columns?

I have a simple Q... I have a dataset I need to filter by certain parameters. I was hoping for a solution in R?
Dummy case:
colour age animal
red 10 dog
yellow 5 cat
pink 6 cat
I want to classify this dataset e.g. by:
If colour is 'red' OR 'pink' AND age is <7 AND animal is 'cat' then = category 1.
Else category 2.
Output would be:
colour age animal category
red 10 dog 2
yellow 5 cat 2
pink 6 cat 1
Is there a way to manipulate dplyr to achieve this? I'm a clinician not a bioinformatician so go easy!
I like the case_when function in dplyr to set up more complex selections with mutate.
library(tidyverse)
df <- data.frame(colour = c("red", "yellow", "pink", "red", "pink"),
age = c(10, 5, 6, 12, 10),
animal = c("dog", "cat", "cat", "hamster", "cat"))
df
#> colour age animal
#> 1 red 10 dog
#> 2 yellow 5 cat
#> 3 pink 6 cat
#> 4 red 12 hamster
#> 5 pink 10 cat
df <- mutate(df, category = case_when(
((colour == "red" | colour == "pink") & age < 7 & animal == "cat") ~ 1,
(colour == "yellow" | age != 5 & animal == "dog") ~ 2,
(colour == "pink" | animal == "cat") ~ 3,
(TRUE) ~ 4) )
df
#> colour age animal category
#> 1 red 10 dog 2
#> 2 yellow 5 cat 2
#> 3 pink 6 cat 1
#> 4 red 12 hamster 4
#> 5 pink 10 cat 3
Created on 2021-01-17 by the reprex package (v0.3.0)
You could also manipulate this as :
df$category <- with(df,!(colour %in% c('red', 'pink') & age < 7 & animal == 'cat')) + 1
df
# colour age animal category
#1 red 10 dog 2
#2 yellow 5 cat 2
#3 pink 6 cat 1
And in dplyr :
df %>%
mutate(category = as.integer(!(colour %in% c('red', 'pink') &
age < 7 & animal == 'cat')) + 1)

R variable number of string concatenations within group_by

Let's say I have the following table of houses (or anything) and their colors:
I'm trying to:
group_by(Group)
count rows (I assume with length(unique(ID)),
mutate or summarize into a new row with a count of each color in group, as a string.
Result should be:
So I know step 3 could be done by manually entering every possible combination with something like
df <- df %>%
group_by(Group) %>%
mutate(
Summary = case_when(
all(
sum(count_green) > 0
) ~ paste(length(unique(ID)), " houses, ", count_green, " green")
)
)
but what if I have hundreds of possible combinations? Is there a way to paste into a string and append for each new color/count?
Here is one approach where we count the frequency of 'Group', 'Color' with add_count, unite that with 'Color', then grouped by 'Group', create the 'Summary' column by concatenating the unique elements of 'nColor' with the frequency (n())
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(Group, Color) %>%
unite(nColor, n, Color, sep= ' ', remove = FALSE) %>%
group_by(Group) %>%
mutate(
Summary = str_c(n(), ' houses, ', toString(unique(nColor)))) %>%
select(-nColor)
# Groups: Group [2]
# ID Group Color n Summary
# <int> <chr> <chr> <int> <chr>
#1 1 a Green 2 3 houses, 2 Green, 1 Orange
#2 2 a Green 2 3 houses, 2 Green, 1 Orange
#3 3 a Orange 1 3 houses, 2 Green, 1 Orange
#4 4 b Blue 2 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 1 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 2 3 houses, 2 Blue, 1 Yellow
data
df <- structure(list(ID = 1:6, Group = c("a", "a", "a", "b", "b", "b"
), Color = c("Green", "Green", "Orange", "Blue", "Yellow", "Blue"
)), class = "data.frame", row.names = c(NA, -6L))
Here's an approach with map_chr from purrr and a lot of pasting.
library(dplyr)
library(purrr)
df %>%
group_by(Group) %>%
mutate(Summary = paste(n(),"houses,",
paste(map_chr(unique(as.character(Color)),
~paste(sum(Color == .x),.x)),
collapse = ", ")))
## A tibble: 6 x 4
## Groups: Group [2]
# ID Group Color Summary
# <int> <fct> <fct> <chr>
#1 1 a Green 3 houses, 2 Green, 1 Orange
#2 2 a Green 3 houses, 2 Green, 1 Orange
#3 3 a Orange 3 houses, 2 Green, 1 Orange
#4 4 b Blue 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 3 houses, 2 Blue, 1 Yellow

Replacing NA found with filter() using lapply() when all df do not have NA [duplicate]

This question already has answers here:
Replace NaN values in a list with zero (0)
(4 answers)
Closed 4 years ago.
I've been having trouble trying to get lapply to replace NAs in a dataframe that I find using a call to filter().
tib <- as_tibble(data.frame("Group"= c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"), "Color" = c("Red", "Red", "Red", "Blue", "Blue", "Blue", "Red", "Red", "Red", "Blue", "Blue", "Blue"), "Value" = c(5,NA,6,NA,16,12,4,5,6,10,12,17)))
> list.tib <- split(tib, tib$Group)
> list.tib
$`A`
# A tibble: 6 x 3
Group Color Value
<fct> <fct> <dbl>
1 A Red 5
2 A Red NA
3 A Red 6
4 A Blue NA
5 A Blue 16
6 A Blue 12
$B
# A tibble: 2 x 3
Group Color Value
<fct> <fct> <dbl>
1 B Red 4
2 B Blue 17
I want to replace the NA within [["A"]] with another value using lapply.
If I try to assign the NA an arbitrary value (here I use 50) using either "<-" or "=" it get errors stating "could not find function "filter<-"
> lapply(list.tib, function(x) filter(x, is.na(Value))$Value <- 50)
Error in filter(x, is.na(Value))$Value <- 50 :
could not find function "filter<-"
I tried another approach using a different format for designating the values I wanted but ended up with a different type of error.
> lapply(list.tib, function(x) x[which(is.na(x$Value)),]$Value <- 50)
Error in `$<-.data.frame`(`*tmp*`, "Value", value = 50) :
replacement has 1 row, data has 0
Which I think throws an error because [["B"]] does not have any NAs, and I'm trying to set numeric(0) to a value of 50.
I would like a function that would provide the output of:
> list.tib
$`A`
# A tibble: 6 x 3
Group Color Value
<fct> <fct> <dbl>
1 A Red 5
2 A Red 50
3 A Red 6
4 A Blue 50
5 A Blue 16
6 A Blue 12
$B
# A tibble: 2 x 3
Group Color Value
<fct> <fct> <dbl>
1 B Red 4
2 B Blue 17
I am able to get this desired result if I do something like:
list.tib$A[which(is.na(list.tib$A$Value)),]$Value <- 50
But that is not generalizable. I think lapply() is the call for the job, but I can't get it to assign values to specific variables of an observation.
Thank you for the help!
If Value column is present is all data.frames then you can simply write lapply as:
lapply(split(tib, tib$Group), function(x){
x$Value[is.na(x$Value)]<-50
x
})
# $A
# # A tibble: 6 x 3
# Group Color Value
# <fctr> <fctr> <dbl>
# 1 A Red 5.00
# 2 A Red 50.0
# 3 A Red 6.00
# 4 A Blue 50.0
# 5 A Blue 16.0
# 6 A Blue 12.0
#
# $B
# # A tibble: 6 x 3
# Group Color Value
# <fctr> <fctr> <dbl>
# 1 B Red 4.00
# 2 B Red 5.00
# 3 B Red 6.00
# 4 B Blue 10.0
# 5 B Blue 12.0
# 6 B Blue 17.0
We can use mutate and ifelse.
library(tidyverse)
lapply(list.tib, function(x) x %>% mutate(Value = ifelse(is.na(Value), 50, Value)))
Or replace_na from the tidyr.
lapply(list.tib, function(x) x %>% replace_na(list(Value = 50)))
lapply(list.tib, function(x) x %>% mutate(Value = replace_na(Value, 50)))

Determining most/least amount of occurrences within subset row & column group in a data frame

I am trying to find the most and least amount of items within a row / column group in a larger data frame. Here is the data to make it clearer:
df <- data.frame(matrix(nrow = 8, ncol = 3))
df$X1 <- c(1, 1, 1, 2, 2, 3, 3, 3)
df$X2 <- c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange")
df$X3 <- c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA)
names(df) <- c("group", "A", "B")
Here is what that looks like (I have NAs in the original data, so I've included them):
group A B
1 1 yellow green
2 1 green yellow
3 1 yellow <NA>
4 2 blue blue
5 2 <NA> red
6 3 orange purple
7 3 <NA> orange
8 3 orange <NA>
In the first "group", for instance, I want to determine which color occurs the most and which color occurs the least. Something that looks like this:
group A B most least
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
I am working within a dplyr chain in the original data so I can group_by "group", but I am having a hard time figuring out a method that allows me to work within a "cluster" of two columns with differing numbers of rows. I do not need this to be done with dplyr, but I figured it might be easiest given the usefulness of group_by. Additionally, I need the result to somehow remain in the original data frame as new columns. Any suggestions?
A solution uses dplyr and tidyr. The strategy is to find the "most" and "least" item and prepare a new data frame. After that, use the right_join to merge the original data frame and prepare the desired output.
Notice that during the process I used slice to subset the data frame to get the most and least item. This guarantees that there will be only one "most" and one "least" for each group. Nevertheless, it is possible that there could be a tie for each group. If that happens, you may want to think about what could be a good rule to determine which one is the "most" or which one is the "least".
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, na.rm = TRUE) %>%
count(group, Value) %>%
arrange(group, desc(n)) %>%
group_by(group) %>%
slice(c(1, n())) %>%
mutate(Type = c("most", "least")) %>%
select(-n) %>%
spread(Type, Value) %>%
right_join(df, by = "group") %>%
select(c(colnames(df), "most", "least"))
df2
# A tibble: 8 x 5
group A B most least
<dbl> <chr> <chr> <chr> <chr>
1 1 yellow green yellow green
2 1 green yellow yellow green
3 1 yellow <NA> yellow green
4 2 blue blue blue red
5 2 <NA> red blue red
6 3 orange purple orange purple
7 3 <NA> orange orange purple
8 3 orange <NA> orange purple
Two options:
Reshape to long form and use summarise (or count) to aggregate, subsetting the which.max/which.min:
library(tidyverse)
df <- data_frame(group = c(1, 1, 1, 2, 2, 3, 3, 3),
A = c("yellow", "green", "yellow", "blue", NA, "orange", NA, "orange"),
B = c("green", "yellow", NA, "blue", "red", "purple" , "orange", NA))
df %>%
gather(var, color, A:B) %>%
drop_na(color) %>%
group_by(group, color) %>%
summarise(n = n()) %>%
summarise(most = color[which.max(n)],
least = color[which.min(n)]) %>%
left_join(df, .)
#> Joining, by = "group"
#> # A tibble: 8 x 5
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple
Sort a table of values and subset it:
df %>%
group_by(group) %>%
mutate(most = last(names(sort(table(c(A, B))))),
least = first(names(sort(table(c(A, B))))))
#> # A tibble: 8 x 5
#> # Groups: group [3]
#> group A B most least
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 yellow green yellow green
#> 2 1 green yellow yellow green
#> 3 1 yellow <NA> yellow green
#> 4 2 blue blue blue red
#> 5 2 <NA> red blue red
#> 6 3 orange purple orange purple
#> 7 3 <NA> orange orange purple
#> 8 3 orange <NA> orange purple

r Group by and count

I am dealing with a dataset which is as follows
Id Date Color
10 2008-11-17 Red
10 2008-11-17 Red
10 2008-11-17 Blue
10 2010-01-26 Red
10 2010-01-26 Green
10 2010-01-26 Green
10 2010-01-26 Red
29 2007-07-31 Red
29 2007-07-31 Red
29 2007-07-31 Blue
29 2007-07-31 Green
29 2007-07-31 Red
My goal is to create a dataset like this
Color Representation Count Min Max
Red 1 + 1 + 1 = 3 2 + 2 + 3 = 7 2 3
Blue 1 + 1 = 2 1 + 1 1 1
Green 1 + 1 = 2 2 + 1 1 2
Representation
The value in 1st Row , 2nd column (Representation), is 3 because Red is represented three times based on the unique combination of ID and Date. For example, 1st and 2nd rows are the same, Id(10) and Date(2008-11-17) so this combination is represented once (1(10, 2008-11-17)). The 4th and 7th rows are the same Id(10) and Date(2010-01-26)combination, so this unique combination, is represented once (1(10, 2010-01-26)) . The 8th, 9th, 12th are the same combinations of Id(29) and Date(2007-07-31) and similarly this is represented once (1(29, 2007-07-31)). Thus the value is 3 in row 1, column 2.
1(10, 2008-11-17) + 1(10, 2010-10-26) + 1(29, 2007-07-31) =3
Count
The value in 1st Row , 3rd column (Count), is 7 because Red is mentioned twice by ID 10 on 2008-11-17 (2 10, 2008-11-17), again Red is mentioned twice by ID 10 on 2010-01-26 (2 10, 2010-01-26) and three times by ID 29 on 2007-07-31 2 29,2007-07-31
2(10, 2008-11-17) + 2(10, 2010-10-26) + 3(29, 2007-07-31)
Any help on accomplishing this unique frequency/count table is much appreciated.
Dataset
Id = c(10,10,10,10,10,10,10,29,29,29,29,29)
Date = c("2008-11-17", "2008-11-17", "2008-11-17","2010-01-26","2010-01-26","2010-01-26","2010-01-26",
"2007-07-31","2007-07-31","2007-07-31","2007-07-31","2007-07-31")
Color = c("Red", "Red", "Blue", "Red", "Green", "Green", "Red", "Red", "Red", "Blue", "Green", "Red")
df = data.frame(Id, Date, Color)
With dplyr:
library(dplyr)
dat %>% group_by(Color) %>%
summarize(Representation = n_distinct(Id, Date), Count = n())
# # A tibble: 3 × 3
# Color Representation Count
# <fctr> <int> <int>
# 1 Blue 2 2
# 2 Green 2 3
# 3 Red 3 7
Another option is data.table
library(data.table)
setDT(df)[, .(Representation = uniqueN(paste(Id, Date)), Count = .N) , by = Color]
# Color Representation Count
#1: Red 3 7
#2: Blue 2 2
#3: Green 2 3
Update
For the second question, we can try
library(matrixStats)
m1 <- sapply(split(df[["Color"]], list(df$Id, df$Date), drop = TRUE), function(x) table(x))
v1 <- (NA^!m1) * m1
df1 <- data.frame(Color = row.names(m1), Representation = rowSums(m1!=0),
Count = rowSums(m1), Min = rowMins(v1, na.rm=TRUE),
Max = rowMaxs(v1, na.rm=TRUE))
row.names(df1) <- NULL
df1
# Color Representation Count Min Max
#1 Blue 2 2 1 1
#2 Green 2 3 1 2
#3 Red 3 7 2 3
You can use the aggregate() function:
# Make a new column for the Date-Id joined (what you want to base the counts on
df$DateId <- paste(df$Date, df$Id)
# Get the representation values
Representation <- aggregate(DateId ~ Color, data=df,FUN=function(x){length(unique(x))})
Representation
#> Color DateId
#> 1 Blue 2
#> 2 Green 2
#> 3 Red 3
# Get the Count values
Count <- aggregate(DateId ~ Color, data=df,FUN=length)
Count
#> Color DateId
#> 1 Blue 2
#> 2 Green 3
#> 3 Red 7

Resources