Select groups which have at least one of a certain value - r

How to select groups based on a condition on the individual rows, say keep all groups that contain at least one (ANY) of a certain value, e.g. 4, (or any other condition that is TRUE at least once). Or phrased the other way around: if a group does not have any rows where condition is true, the entire group should be removed.
Let's take a very simple data, with two groups, and I want to select the group that has at least one row with a Value of 4, (i.e. group B here)
library(dplyr)
df <- data.frame(Group = LETTERS[c(1,1,1,2,2,2)], Value=c(1:5, 4))
df
# Group Value
# 1 A 1 # Group A has no values == 4 ~~> remove entire group
# 2 A 2
# 3 B 3
# 4 B 4 # Group B has at least one 4 ~~> keep the whole group
Doing group_by() and then filter (as in this post) will only select individual rows that contains a value of 4, not the whole group:
df %>%
group_by(Group) %>%
filter(Value == 4)
# Group Value
# <fctr> <int>
# 1 B 4

This turns out to be pretty easy: you just need to use the any() function in the filter call. Indeed, it appears that:
filter(any(...)) evaluates at the group_by() level,
filter(...) evaluates at the rowwise() level, even when preceded by group_by().
Hence use:
df %>%
group_by(Group) %>%
filter(any(Value==4))
Group Value
<fctr> <int>
1 B 3
2 B 4
Interestingly, the same appear with mutate, compare:
df %>%
group_by(Group) %>%
mutate(check1=any(Value==4),
check2=Value==4)
Group Value check1 check2
<fctr> <int> <lgl> <lgl>
1 A 1 FALSE FALSE
2 A 2 FALSE FALSE
3 B 3 TRUE FALSE
4 B 4 TRUE TRUE

A data.table option is
library(data.table)
setDT(df)[, if(any(Value==4)) .SD, by = Group]
# Group Value
#1: B 4
#2: B 5
#3: B 4

In base R, without performing any grouping operation we can do :
subset(df, Group %in% unique(Group[Value == 4]))
# Group Value
#4 B 4
#5 B 5
#6 B 4

Related

count number of combinations by group

I am struggling to count the number of unique combinations in my data. I would like to first group them by the id and then count, how many times combination of each values occurs. here, it does not matter if the elements are combined in 'd-f or f-d, they still belongs in teh same category, as they have same element:
combinations:
n
c-f: 2 # aslo f-c
c-d-f: 1 # also cfd or fdc
d-f: 2 # also f-d or d-f. The dash is only for isualization purposes
Dummy example:
# my data
dd <- data.frame(id = c(1,1,2,2,2,3,3,4, 4, 5,5),
cat = c('c','f','c','d','f','c','f', 'd', 'f', 'f', 'd'))
> dd
id cat
1 1 c
2 1 f
3 2 c
4 2 d
5 2 f
6 3 c
7 3 f
8 4 d
9 4 f
10 5 f
11 5 d
Using paste is a great solution provided by #benson23, but it considers as unique category f-d and d-f. I wish, however, that the order will not matter. Thank you!
Create a "combination" column in summarise, we can count this column afterwards.
An easy way to count the category is to order them at the beginning, then in this case they will all be in the same order.
library(dplyr)
dd %>%
group_by(id) %>%
arrange(id, cat) %>%
summarize(combination = paste0(cat, collapse = "-"), .groups = "drop") %>%
count(combination)
# A tibble: 3 x 2
combination n
<chr> <int>
1 c-d-f 1
2 c-f 2
3 d-f 2

R dplyr: filter common values by group

I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]

Mutate by group based on a conditional

I am trying to add a summary column to a dataframe. Although the summary statistic should be applied to every column, the statistic itself should only be calculated based on conditional rows.
As an example, given this dataframe:
x <- data.frame(usernum=rep(c(1,2,3,4),each=3),
final=rep(c(TRUE,TRUE,FALSE,FALSE)),
time=1:12)
I would like to add a usernum.mean column, but where the mean is only calculated when final=TRUE. I have tried:
library(tidyverse)
x %>%
group_by(usernum) %>%
mutate(user.mean = mean(x$time[x$final==TRUE]))
but this gives an overall mean, rather than by user. I have also tried:
x %>%
group_by(usernum) %>%
filter(final==TRUE) %>%
mutate(user.mean = mean(time))
but this only returns the filtered dataframe:
# A tibble: 6 x 4
# Groups: usernum [4]
usernum final time user.mean
<dbl> <lgl> <int> <dbl>
1 1 TRUE 1 1.5
2 1 TRUE 2 1.5
3 2 TRUE 5 5.5
4 2 TRUE 6 5.5
5 3 TRUE 9 9
6 4 TRUE 10 10
How can I apply those means to every original row?
If we use x$ after the group_by, it returns the entire column instead of only the values in that particular group. Second, TRUE/FALSE is logical vector, so we don't need ==
library(dplyr)
x %>%
group_by(usernum) %>%
mutate(user.mean = mean(time[final]))
The one option where we can use $ is with .data
x %>%
group_by(usernum) %>%
mutate(user.mean = mean(.data$time[.data$final]))

determine duplicate rows whose at least one row has different value in a column [duplicate]

I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2

Filter rows by last maximal value ordering by a time value

I have a dataframe with an id, an ordering time value and a value. And for each group of ids, I would like to remove rows having a smaller value than rows having smaller time value.
data <- data.frame(id = c(rep(c("a", "b"), each = 3L), "b"),
time = c(0, 1, 2, 0, 1, 2, 3),
value = c(1, 1, 2, 3, 1, 2, 4))
> data
id time value
1 a 0 1
2 a 1 1
3 a 2 2
4 b 0 3
5 b 1 1
6 b 2 2
7 b 3 4
So the result would be :
> data
id time value
1 a 0 1
2 a 2 2
3 b 0 3
4 b 3 4
(For id == b rows where time %in% c(3, 4) are removed because the value value is smaller than when time is lower)
I was thinking about lag
data %>%
group_by(id) %>%
filter(time == 0 | lag(value, order_by = time) < value)
Source: local data frame [5 x 3]
Groups: id [2]
id time value
<fctr> <dbl> <dbl>
1 a 0 1
2 a 2 2
3 b 0 3
4 b 2 2
5 b 3 4
But it doesn't work as expected since it's a vectorized function, so instead the idea would be to use a "recursive lag function" or to check the last maximal value. I can do it recursively with a loop but I'm sure there is a more straightforward and high level way to do it.
Any help would be appreciated, thank you !
Here is a data.table solution:
library(data.table)
setDT(data)
data[, myVal := cummax(c(0, shift(value)[-1])), by=id][value > myVal][, myVal := NULL][]
id time value
1: a 0 1
2: a 2 2
3: b 0 3
4: b 3 4
The first part of the chain uses shift and cummax to create the cumulative maximum of the lagged value variable. In c(0, shift(value)[-1]), 0 is added to supply a value lover than any in the variable. More generally, you could use min(value)-1 the [-1] subsetting removes the first element of shift, which is NA. The second part of the chain selects observations where value is greater than the cumulative maximum. The final two chains remove the cumulative maximum variable and print out the result.
Another option is to perform a self anti/non-equi join using data.table
library(data.table) # v1.10.0
setDT(data)[!data, on = .(id, time > time, value <= value)]
# id time value
# 1: a 0 1
# 2: a 2 2
# 3: b 0 3
# 4: b 3 4
Which is basically saying: "If time is larger but value is less-equal, then I don't want these rows (! sign)"
Here is an option with dplyr. After grouping by 'id', we filter the rows where the 'value' is greater than the cumulative maximum of the 'lag' of the 'value' column
library(dplyr)
data %>%
group_by(id) %>%
filter(value > cummax(lag(value, default = 0)) )
# id time value
# <fctr> <dbl> <dbl>
#1 a 0 1
#2 a 2 2
#3 b 0 3
#4 b 3 4
Or another option is slice after arrangeing by 'id' and 'time' (as the OP mentioned about the order
data %>%
group_by(id) %>%
arrange(id, time) %>%
slice(which(value > cummax(lag(value, default = 0))))

Resources