Filter by value counts within groups

Filter by value counts within groups - r

I want to filter my grouped dataframe based on the number of occurrences of a specific value within a group.
Some exemplary data:
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
The “lapse” column is my filter variable in this case.
I want to exclude every “ID” group that has more than 15 counts of “lapse” == 2 within!
data %>% group_by(ID) %>% count(lapse == 2)
So, if for example the group “A” has 17 times “lapse” == 2 within it should be filtered entirely from the datafame.

First I created some reproducible data using a set.seed and check the number of values per group. It seems that in this case only group D more values with lapse 2 has. You can use filter and sum the values with lapse 2 per group like this:
set.seed(7)
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
library(dplyr)
# Check n values per group
data %>%
group_by(ID, lapse) %>%
summarise(n = n())
#> # A tibble: 8 × 3
#> # Groups: ID [4]
#> ID lapse n
#> <chr> <int> <int>
#> 1 A 1 8
#> 2 A 2 7
#> 3 B 1 13
#> 4 B 2 15
#> 5 C 1 18
#> 6 C 2 6
#> 7 D 1 17
#> 8 D 2 16
data %>%
group_by(ID) %>%
filter(!(sum(lapse == 2) > 15))
#> # A tibble: 67 × 3
#> # Groups: ID [3]
#> ID rt lapse
#> <chr> <dbl> <int>
#> 1 B 0.517 2
#> 2 C 0.589 1
#> 3 C 0.598 2
#> 4 C 0.715 1
#> 5 B 0.475 2
#> 6 C 0.965 1
#> 7 B 0.234 1
#> 8 B 0.812 2
#> 9 C 0.517 1
#> 10 B 0.700 1
#> # … with 57 more rows
Created on 2023-01-08 with reprex v2.0.2

Related

Difference by subgroup using R

I have the following dataset:
I want to calculate the difference between values according to the subgroups. Nevertheless, subgroup 1 must come first. Thus 10-0=10; 0-20=-20; 30-31=-1. I want to perform it using R.
I know that it would be something like this, but I do not know how to put the sub_group into the code:
library(tidyverse)
df %>%
group_by(group) %>%
summarise(difference= diff(value))

Edited answer after OP's comment:
The OP clarified that the data are not sorted by sub_group within every group. Therefore, I added the arrange after group_by. The OP further clarified that the value of sub_group == 1 always should be the first term of the difference.
Below I demonstrate how to achieve this in an example with 3 sub_groups within every group. The code rests on the assumption that the lowest value of sub_group == 1. I drop each group's first sub_group after the difference.
library(tidyverse)
df <- tibble(group = rep(LETTERS[1:3], each = 3),
sub_group = rep(1:3, 3),
value = c(10,0,5,0,20,15,30,31,10))
df
#> # A tibble: 9 × 3
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 1 10
#> 2 A 2 0
#> 3 A 3 5
#> 4 B 1 0
#> 5 B 2 20
#> 6 B 3 15
#> 7 C 1 30
#> 8 C 2 31
#> 9 C 3 10
df |>
group_by(group) |>
arrange(group, sub_group) |>
mutate(value = first(value) - value) |>
slice(2:n())
#> # A tibble: 6 × 3
#> # Groups: group [3]
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 2 10
#> 2 A 3 5
#> 3 B 2 -20
#> 4 B 3 -15
#> 5 C 2 -1
#> 6 C 3 20
Created on 2022-10-18 with reprex v2.0.2
P.S. (from the original answer)
In the example data, you show the wrong difference for group C. It should read -1. I am convinced that most people here would appreciate if you could post your example data using code or at least as text which can be copied instead of a picture.

How to Create Iterative Forumla to calculate Z Score in R?

I have a number of large data frames that have the following basic format, where the final two rows are a mean (d) and standard deviation (e) - although these are calculated elsewhere.
a b c
a 4 3 4
b 3 2 6
c 2 1 8
d 3 2 6
e 1 1 2
I would like to create an iterative function that converts each raw data point into a z-score via the mean and sd value in d and e per column. The formula I would like to apply is ((x-mean)/SD).
The result would be the following:
a b c
a 1 1 1
b 0 0 0
c -1 -1 -1
I don't mind if this is added to the end, created as a new dataframe or the data is converted.
Thanks!

Here is one approach, note that I do not use the mean/sd provided in the data but re-calculate it on the fly.
Also note that usually the data should be in a tidy data representation, which in your case would mean that a, b, c would be in columns and then mean/sd would be either calculated on the fly or be in a separate column (note that this would reshaping the data, not shown here).
# your input data
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
raw_data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
#> d 3 2 6
#> e 1 1 2
# remove the mean/sd values
data <- raw_data[!rownames(raw_data) %in% c("d", "e"), ]
data
#> a b c
#> a 4 3 4
#> b 3 2 6
#> c 2 1 8
# quick way to recalculate the values
means <- apply(data, 2, mean)
means
#> a b c
#> 3 2 6
sds <- apply(data, 2, sd)
sds
#> a b c
#> 1 1 2
z_scores <- apply(data, 2, function(x) (x - mean(x)) / sd(x))
z_scores
#> a b c
#> a 1 1 -1
#> b 0 0 0
#> c -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)
Edit / Full Code
The following code is a bit longer but most of it is spent on getting the data into the right (long/tidy) format.
If you have any questions, feel free to use the comments.
Note that the tidyverse is really helpful, but might need some time to get used to. The code used here is mostly dplyr (included in the tidyverse).
If you understand the functions: %>% (pipe), group_by(), mutate(), summarise(), and pivot_longer/wider() you got everything.
library(tidyverse)
# use your original dataset again
raw_data <- data.frame(
a = c(4, 3, 2, 3, 1),
b = c(3, 2, 1, 2, 1),
c = c(4, 6, 8, 6, 2),
row.names = c("a", "b", "c", "d", "e")
)
### 1) Turn the data into a nicer format
# match-table how to rename the variables
var_match <- c(d = "mean", e = "sd")
# convert the raw data into a nicer format, first we do some minor changes
# (variable names, etc)
data_mixed <- raw_data %>%
# have the rownames as explicit variable
rownames_to_column("metric") %>%
# nicer printing etc
as_tibble() %>%
# replace variable names with mean/sd
mutate(metric = ifelse(metric %in% c("d", "e"),
var_match[metric], metric))
data_mixed
#> # A tibble: 5 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 4
#> 2 b 3 2 6
#> 3 c 2 1 8
#> 4 mean 3 2 6
#> 5 sd 1 1 2
# separate the dataset into two:
# data holds the values
# data_vars holds the metrics mean and sd
data <- data_mixed %>% filter(!metric %in% var_match) %>% select(-metric)
data_vars <- data_mixed %>% filter(metric %in% var_match)
data
#> # A tibble: 3 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 4 3 4
#> 2 3 2 6
#> 3 2 1 8
data_vars
#> # A tibble: 2 x 4
#> metric a b c
#> <chr> <dbl> <dbl> <dbl>
#> 1 mean 3 2 6
#> 2 sd 1 1 2
# turn the value dataset into its longer form, makes it easier to work with it later
data_long <- data %>%
pivot_longer(everything(), names_to = "var", values_to = "val")
data_long
#> # A tibble: 9 x 2
#> var val
#> <chr> <dbl>
#> 1 a 4
#> 2 b 3
#> 3 c 4
#> 4 a 3
#> 5 b 2
#> 6 c 6
#> 7 a 2
#> 8 b 1
#> 9 c 8
# turn the metric dataset into another long form, allowing easy combination in the next step
data_vars2 <- data_vars %>%
pivot_longer(-metric, names_to = "var", values_to = "val") %>%
pivot_wider(var, names_from = metric, values_from = val)
data_vars2
#> # A tibble: 3 x 3
#> var mean sd
#> <chr> <dbl> <dbl>
#> 1 a 3 1
#> 2 b 2 1
#> 3 c 6 2
# combine the datasets
data_all <- left_join(data_long, data_vars2, by = "var")
data_all
#> # A tibble: 9 x 4
#> var val mean sd
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 4 3 1
#> 2 b 3 2 1
#> 3 c 4 6 2
#> 4 a 3 3 1
#> 5 b 2 2 1
#> 6 c 6 6 2
#> 7 a 2 3 1
#> 8 b 1 2 1
#> 9 c 8 6 2
## 2) calculate the z-score
# now comes the actual number crunchin!
# per variable var (a, b, c) compute the variable val_z as the z-score
data_res <- data_all %>%
group_by(var) %>%
mutate(val_z = (val - mean) / sd)
data_res
#> # A tibble: 9 x 5
#> # Groups: var [3]
#> var val mean sd val_z
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 4 3 1 1
#> 2 b 3 2 1 1
#> 3 c 4 6 2 -1
#> 4 a 3 3 1 0
#> 5 b 2 2 1 0
#> 6 c 6 6 2 0
#> 7 a 2 3 1 -1
#> 8 b 1 2 1 -1
#> 9 c 8 6 2 1
## 3) make the results more readable
# lastly pivot the results to its original form
data_res_wide <- data_res %>%
select(var, val_z) %>%
group_by(var) %>%
mutate(id = 1:n()) %>% # needed for easier identification of values
pivot_wider(id, names_from = var, values_from = val_z)
data_res_wide
#> # A tibble: 3 x 4
#> id a b c
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 1 -1
#> 2 2 0 0 0
#> 3 3 -1 -1 1
Created on 2021-01-07 by the reprex package (v0.3.0)

select only rows with duplicate id and specific value from another column in R

I have the following data with ID and value:
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2","1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2","2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5","2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1","2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
If you notice, there are multiple values for the same id. What I'd like to do is get the value that are only 3 and 6 only if the IDs are the same. for eg. ID "1103-5" has both 3 and 6, so it should be in the list, but not "2347-2"
I'm using R
One method I tried is the following, but it gives me everything with value 3 and 6.
d <- data.frame(id, value)
group36 <- d[d$value == 3 | d$value == 6,]
and
d %>% group_by(id) %>% filter(3 == value | 6 == value)
The output should be like this:
id value
1103-5 6
1103-5 3
1104-2 6
1104-2 3
1104-4 6
1104-4 3
1106-2 6
1106-2 3
1106-3 6
1106-3 3
2294-1 3
2294-1 6
2294-2 3
2294-2 6
2294-3 3
2294-3 6
2294-4 3
2294-4 6
2294-5 3
2294-5 6

d<-group_by(d,id)
filter(d,any(value==3),any(value==6))
This gives you all the IDs where there is both a value of 3 (somewhere) AND a value of 6 (somewhere). Mind you, your data contains some IDs with THREE values. In these cases, if both 3 and 6 are present, it will be included in the result.
If you want to exclude those lines that remain which done equal 3 or 6, add this:
filter(d,value==3 | value==6)
If you want to exclude IDs that also have 3 and 6 as values but also have OTHER values, use this:
filter(d,any(value==3),any(value==6),value==3 | value==6)

Not sure if this is what you want. We can filter rows that equal to either 3 or 6 then convert from long to wide format and keep only columns which have both 3 and 6 values. After that, convert back to long format.
library(dplyr)
library(tidyr)
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2",
"1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2",
"2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5",
"2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1",
"2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
d <- data.frame(id, value)
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.)))
#> # A tibble: 2 x 11
#> rows `1103-5` `1104-2` `1104-4` `1106-2` `1106-3` `2294-1` `2294-2`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 6 6 6 6 3 3
#> 2 2 3 3 3 3 3 6 6
#> # ... with 3 more variables: `2294-3` <dbl>, `2294-4` <dbl>,
#> # `2294-5` <dbl>
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.))) %>%
select(-rows) %>%
gather(id, value)
#> # A tibble: 20 x 2
#> id value
#> <chr> <dbl>
#> 1 1103-5 6
#> 2 1103-5 3
#> 3 1104-2 6
#> 4 1104-2 3
#> 5 1104-4 6
#> 6 1104-4 3
#> 7 1106-2 6
#> 8 1106-2 3
#> 9 1106-3 6
#> 10 1106-3 3
#> 11 2294-1 3
#> 12 2294-1 6
#> 13 2294-2 3
#> 14 2294-2 6
#> 15 2294-3 3
#> 16 2294-3 6
#> 17 2294-4 3
#> 18 2294-4 6
#> 19 2294-5 3
#> 20 2294-5 6
Created on 2018-07-01 by the reprex package (v0.2.0.9000).

Average across rows, but leaving out own group [duplicate]

Using dplyr (preferably), I am trying to calculate the group mean for each observation while excluding that observation from the group.
It seems that this should be doable with a combination of rowwise() and group_by(), but both functions cannot be used simultaneously.
Given this data frame:
df <- data_frame(grouping = rep(LETTERS[1:5], 3),
value = 1:15) %>%
arrange(grouping)
df
#> Source: local data frame [15 x 2]
#>
#> grouping value
#> (chr) (int)
#> 1 A 1
#> 2 A 6
#> 3 A 11
#> 4 B 2
#> 5 B 7
#> 6 B 12
#> 7 C 3
#> 8 C 8
#> 9 C 13
#> 10 D 4
#> 11 D 9
#> 12 D 14
#> 13 E 5
#> 14 E 10
#> 15 E 15
I'd like to get the group mean for each observation with that observation excluded from the group, resulting in:
#> grouping value special_mean
#> (chr) (int)
#> 1 A 1 8.5 # i.e. (6 + 11) / 2
#> 2 A 6 6 # i.e. (1 + 11) / 2
#> 3 A 11 3.5 # i.e. (1 + 6) / 2
#> 4 B 2 9.5
#> 5 B 7 7
#> 6 B 12 4.5
#> 7 C 3 ...
I've attempted nesting rowwise() inside a function called by do(), but haven't gotten it to work, along these lines:
special_avg <- function(chunk) {
chunk %>%
rowwise() #%>%
# filter or something...?
}
df %>%
group_by(grouping) %>%
do(special_avg(.))

No need to define a custom function, instead we could simply sum all elements of the group, subtract the current value, and divide by number of elements per group minus 1.
df %>% group_by(grouping) %>%
mutate(special_mean = (sum(value) - value)/(n()-1))
# grouping value special_mean
# (chr) (int) (dbl)
#1 A 1 8.5
#2 A 6 6.0
#3 A 11 3.5
#4 B 2 9.5
#5 B 7 7.0

I came across this old question just by chance and I wondered if there is a general solution which would work for other aggregation functions besides mean() as well, e.g., max() as requested by jlesuffleur or median().
The idea is to omit the actual row from computing the aggregate by looping over the rows within the actual group:
library(dplyr)
df %>%
group_by(grouping) %>%
mutate(special_mean = sapply(1:n(), function(i) mean(value[-i])))
grouping value special_mean
<chr> <int> <dbl>
1 A 1 8.5
2 A 6 6
3 A 11 3.5
4 B 2 9.5
5 B 7 7
...
This will work for max() as well
df %>%
group_by(grouping) %>%
mutate(special_max = sapply(1:n(), \(i) max(value[-i])))
grouping value special_max
<chr> <int> <int>
1 A 1 11
2 A 6 11
3 A 11 6
4 B 2 12
5 B 7 12
6 B 12 7
...
For the sake of completeness, here is also a data.table solution:
library(data.table)
setDT(df)[, special_mean := sapply(1:.N, function(i) mean(value[-i])), by = grouping][]

Calculate group mean while excluding current observation using dplyr

Using dplyr (preferably), I am trying to calculate the group mean for each observation while excluding that observation from the group.
It seems that this should be doable with a combination of rowwise() and group_by(), but both functions cannot be used simultaneously.
Given this data frame:
df <- data_frame(grouping = rep(LETTERS[1:5], 3),
value = 1:15) %>%
arrange(grouping)
df
#> Source: local data frame [15 x 2]
#>
#> grouping value
#> (chr) (int)
#> 1 A 1
#> 2 A 6
#> 3 A 11
#> 4 B 2
#> 5 B 7
#> 6 B 12
#> 7 C 3
#> 8 C 8
#> 9 C 13
#> 10 D 4
#> 11 D 9
#> 12 D 14
#> 13 E 5
#> 14 E 10
#> 15 E 15
I'd like to get the group mean for each observation with that observation excluded from the group, resulting in:
#> grouping value special_mean
#> (chr) (int)
#> 1 A 1 8.5 # i.e. (6 + 11) / 2
#> 2 A 6 6 # i.e. (1 + 11) / 2
#> 3 A 11 3.5 # i.e. (1 + 6) / 2
#> 4 B 2 9.5
#> 5 B 7 7
#> 6 B 12 4.5
#> 7 C 3 ...
I've attempted nesting rowwise() inside a function called by do(), but haven't gotten it to work, along these lines:
special_avg <- function(chunk) {
chunk %>%
rowwise() #%>%
# filter or something...?
}
df %>%
group_by(grouping) %>%
do(special_avg(.))

No need to define a custom function, instead we could simply sum all elements of the group, subtract the current value, and divide by number of elements per group minus 1.
df %>% group_by(grouping) %>%
mutate(special_mean = (sum(value) - value)/(n()-1))
# grouping value special_mean
# (chr) (int) (dbl)
#1 A 1 8.5
#2 A 6 6.0
#3 A 11 3.5
#4 B 2 9.5
#5 B 7 7.0

I came across this old question just by chance and I wondered if there is a general solution which would work for other aggregation functions besides mean() as well, e.g., max() as requested by jlesuffleur or median().
The idea is to omit the actual row from computing the aggregate by looping over the rows within the actual group:
library(dplyr)
df %>%
group_by(grouping) %>%
mutate(special_mean = sapply(1:n(), function(i) mean(value[-i])))
grouping value special_mean
<chr> <int> <dbl>
1 A 1 8.5
2 A 6 6
3 A 11 3.5
4 B 2 9.5
5 B 7 7
...
This will work for max() as well
df %>%
group_by(grouping) %>%
mutate(special_max = sapply(1:n(), \(i) max(value[-i])))
grouping value special_max
<chr> <int> <int>
1 A 1 11
2 A 6 11
3 A 11 6
4 B 2 12
5 B 7 12
6 B 12 7
...
For the sake of completeness, here is also a data.table solution:
library(data.table)
setDT(df)[, special_mean := sapply(1:.N, function(i) mean(value[-i])), by = grouping][]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Filter by value counts within groups - r

Related

Difference by subgroup using R

How to Create Iterative Forumla to calculate Z Score in R?

select only rows with duplicate id and specific value from another column in R

Average across rows, but leaving out own group [duplicate]

Calculate group mean while excluding current observation using dplyr

Categories

Resources