This question already has answers here:
Remove groups that contain certain strings
(4 answers)
Closed 3 years ago.
3 doctors diagnose a patient
question 1 : how to filter the patient which all 3 doctors diagnose with disease B (no matter B.1, B.2 or B.3)
question 2: how to filter the patient which any of 3 doctors diagnose with disease A.
set.seed(20200107)
df <- data.frame(id = rep(1:5,each =3),
disease = sample(c('A','B'), 15, replace = T))
df$disease <- as.character(df$disease)
df[1,2] <- 'A'
df[4,2] <- 'B.1'
df[5,2] <- 'B.2'
df[6,2] <- 'B.3'·
df
I got a method but I don't know how to write the code. I think in the code any() or all() function shoule be used.
First, I want to group patients by id.
Second, check if all the disease is A or B in each group.
The code like this
df %>% group_by(id) %>% filter_all(all_vars(disease == B))
You can use all assuming every patient is checked by 3 doctors only.
library(dplyr)
df %>% group_by(id) %>% summarise(disease_B = all(grepl('B', disease)))
# id disease_B
# <int> <lgl>
#1 1 FALSE
#2 2 TRUE
#3 3 FALSE
#4 4 FALSE
#5 5 FALSE
If you want to subset the rows of the patient, we can use filter
df %>% group_by(id) %>% filter(all(grepl('B', disease)))
For question 2: similarly, we can use any
df %>% group_by(id) %>% summarise(disease_B = any(grepl('A', disease)))
data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), disease = c("A", "A", "A", "B.1", "B.2",
"B.3", "B", "A", "A", "B", "A", "A", "B", "A", "B")), row.names = c(NA,
-15L), class = "data.frame")
For the question 1, you can replace B.1 B.2 ... by B, then count the number of different "Disease" per patients and filter to keep only those equal to 3 and B:
library(tidyverse)
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease)) %>%
count(Disease) %>%
filter(n == 3 & Disease == "B")
# A tibble: 2 x 3
# Groups: id [2]
id Disease n
<int> <chr> <int>
1 2 B 3
2 4 B 3
For the question 2, similarly, you can replace B.1 ... by B, then filter all rows with Disease is A, then count the number of rows per patients and your output is the patient id and the number of doctors that diagnose the disease A:
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease))%>%
filter(Disease == "A") %>%
count(id)
# A tibble: 3 x 2
# Groups: id [3]
id n
<int> <int>
1 1 1
2 3 3
3 5 2
Related
I was trying to do something kind of simple. My dataframe looks like this:
ID value
1 a
2 b
2 c
3 d
3 d
4 e
4 e
4 e
What I wanted to do is to filter groups with more than one row and where all the values in the value column are the same:
df %>% group_by(ID) %>% filter(n() > 1 & all(mysterious_condition))
So mysterious_condition is what I'm lacking. What I'm trying to achieve is this:
ID value
3 d
3 d
4 e
4 e
4 e
Any thoughts on how to accomplish this?
Thanks!
We may use n_distinct to check for the count of unique elements
library(dplyr)
df %>%
group_by(ID) %>%
filter(n() >1, n_distinct(value) == 1) %>%
ungroup
-output
# A tibble: 5 × 2
ID value
<int> <chr>
1 3 d
2 3 d
3 4 e
4 4 e
5 4 e
data
df <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), value = c("a",
"b", "c", "d", "d", "e", "e", "e")), class = "data.frame", row.names = c(NA,
-8L))
I have a dataset that has two columns. One column indicates the group and each group has only two rows. The second column represents the category. Now I would like to count the percentage of each group not having the same category. So in row 1 and 2, the Category is not the same while in row 3 and 4 it is the same. In the provided data, I would get a percentage of 66.66% as four times the Category changes while it stays the same for two groups.
This is my data:
structure(list(Group = c("A", "A", "B", "B", "C", "C", "D", "D",
"E", "E", "F", "F"), Category = c(1L, 2L, 3L, 3L, 5L, 6L, 7L,
7L, 7L, 6L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-12L))
I have tried the following so far:
Data <- Data %>%
group_by(Group) %>%
count(n())
But I don't now how to write the code in the last line to get my desired percentage. Could someone help me here?
A base solution with tapply():
mean(with(df, tapply(Category, Group, \(x) length(unique(x)))) > 1)
# [1] 0.6666667
With dplyr, you could use n_distinct() to count the number of unique values.
library(dplyr)
df %>%
group_by(Group) %>%
summarise(N = n_distinct(Category)) %>%
summarise(Percent = mean(N > 1))
# # A tibble: 1 × 1
# Percent
# <dbl>
# 1 0.667
To show it for both classes, you can use the following code:
library(dplyr)
Data %>%
group_by(Group) %>%
mutate(unique = as.numeric(n_distinct(Category) == 1)) %>%
ungroup() %>%
summarise(Percent = prop.table(table(unique)))
Output:
# A tibble: 2 × 1
Percent
<table>
1 0.6666667
2 0.3333333
Using base R
counts <- table(df)
prop.table(table(rowSums(counts != 0)))
-output
1 2
0.3333333 0.6666667
Its a bit tricky to explain, Ill try my best, query below. I have a df as below. I need to filter rows by group based on maximum pop in country column but which has not already occurred in the above groups. (As per output (image), the reason why A didnt feature in group2 because it had already occured in Group 1)
In short, I need to get unique values in country column at the same time get maximum value in pop (on a group level). I hope picture can convey what I could not. (Tidyverse solution preferred)
[![Expected output][2]][2]
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L,
100L)), class = "data.frame", row.names = c(NA, -9L))
I think this will do. Explanation of syntax
split the data into list for each group
leave first group (as it will be used as .init in next step but after filtering for the max of pop value.
use purrr::reduce here which will reduce the list of tibbles to a single tibble
iterations used in reduce
.init used as filtered first group
thereafter countries in previous groups removed through anti_join
this data filtered for max pop again
added the previously filtered countries by bind_rows()
Thus, in the end we will have desired tibble.
df %>% group_split(Group) %>% .[-1] %>%
reduce(.init =df %>% group_split(Group) %>% .[[1]] %>%
filter(pop == max(pop)),
~ .y %>%
anti_join(.x, by = c("country" = "country")) %>%
filter(pop == max(pop)) %>%
bind_rows(.x) %>% arrange(Group))
# A tibble: 3 x 3
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 G 100
You can create a helper function that writes the maximum pop from each group in a vector and use it to filter the dataframe.
library(tidyverse)
max_values <- c()
helper <- function(dat, ...){
dat <- dat[!(dat %in% max_values)] # exclude maximum values from previous groups
max_value <- max(dat) # get current max. value
max_values <<- c(max_values, max_value) # append
return(max_value)
}
df %>%
group_by(Group) %>%
filter(pop == helper(pop))
which gives you:
# A tibble: 3 x 3
# Groups: Group [3]
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 H 120
Data used:
> df
Group country pop
1 1 A 200
2 1 B 100
3 1 C 50
4 2 A 200
5 2 E 150
6 2 F 120
7 3 A 200
8 3 E 150
9 3 G 100
10 3 H 120
Here is another possibility, but
Overly Simplified in that it does not take into account
the possibility of a group having a higher population in a Group where
it does not win.
library(dplyr)
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L,
100L)), class = "data.frame", row.names = c(NA, -9L))
df %>%
group_by(country) %>%
summarize(popmax = max(pop)) %>%
inner_join(df, by = c("popmax" = 'pop')) %>%
rename(country = country.y) %>%
select(-country.x) %>%
group_by(country) %>%
arrange(Group) %>%
slice(1) %>%
ungroup() %>%
group_by(Group) %>%
arrange(country) %>%
slice(1) %>%
select(Group, country, popmax) %>%
rename(pop = popmax)
My answer fails (while other answers don't) with this data set:
df <- tribble(
~Group, ~ country, ~pop,
1 , 'A', 200,
1 , 'B', 100,
1 , 'C', 50,
1 , 'G', 150,
2 , 'A', 200,
2 , 'E', 150,
2 , 'F', 120,
3 , 'A', 200,
3 , 'E', 150,
3 , 'G', 100
)
Update
#Crestor who is claiming that my answer is not correct.
My answer is correct because my code gives the desired output as requested by OP.
Your objection that my code does not work on another scenario may be correct, but in this setting it is irrelevant, as my answer was only intended to solve the task at hand.
Here is the answer to your raised scenario with this dataset:
df1 <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"),
pop = c(200L, 100L, 250L, 220L, 150L, 120L, 200L, 150L, 100L
)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))
expected output by Crestor:
# A tibble: 3 x 3
Group country pop
<int> <chr> <int>
1 1 C 250
2 2 A 220
3 3 E 150
My code for your scenario #crestor
library(dplyr)
df1 %>%
group_by(country) %>%
arrange(Group) %>%
filter(pop == max(pop)) %>%
group_by(Group) %>%
filter(pop == max(pop))
Output:
Group country pop
<int> <chr> <int>
1 1 C 250
2 2 A 220
3 3 E 150
Original answer to the question by OP
To keep it simple: First arrange to bring your dataset in position. Then group_by and keep first row in each group with slice. Then group_by Group and filter the max pop
library(dplyr)
df %>%
arrange(country, pop) %>%
group_by(country) %>%
slice(1) %>%
group_by(Group) %>%
filter(pop==max(pop))
Output:
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 G 100
This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]
I have a dataset that has two columns. One is userid, the other is company type, like below:
userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A
I want to know how many unique userid's there are that have company.type of A and B or A and C, (but not B and C).
I'm assuming it's some sort of aggregate function, but I'm not sure how to place the qualifier that company.type has to be A and B or A and C only.
We can do this with base R using table
tbl <- table(df1) > 0
sum(((tbl[, 1] & tbl[,2]) | (tbl[,1] & tbl[,3])) & (!(tbl[,2] & tbl[,3])))
#[1] 2
Here's an idea with dplyr. setequal checks if two vectors are composed of the same elements, regardless of ordering:
library(dplyr)
df %>%
group_by(userid) %>%
summarize(temp = setequal(company.type, c("A", "B")) |
setequal(company.type, c("A", "C"))) %>%
pull(temp) %>%
sum()
# [1] 2
Data:
df <- structure(list(userid = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), company.type = c("A",
"A", "C", "B", "B", "B", "A")), .Names = c("userid", "company.type"
), class = "data.frame", row.names = c(NA, -7L))
See: Check whether two vectors contain the same (unordered) elements in R
Sort DF and reduce it to one row per userid with a types column consisting of a comma-separated string of company types. Then filter it using the indicated condition. Finally use tally to get the number of rows left after filtering. To get the details omit the tally line.
library(dplyr)
DF %>%
arrange(userid, company.type) %>%
group_by(userid) %>%
summarize(types = toString(company.type)) %>%
ungroup %>%
filter(grepl("A.*B|A.*C", types) & ! grepl("B.*C", types)) %>%
tally
giving:
# A tibble: 1 x 1
n
<int>
1 2
Note
The input used, in reproducible form, is:
Lines <- "userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A"
DF <- read.table(text = Lines, header = TRUE)