This has got to be simple, but I'm stuck. I want to mutate some grouped data using a where statement within an ifelse statement. Here's an example that works:
example <- tibble::tribble(
~Group, ~Code, ~Value,
"1", "A", 1,
"1", "B", 1,
"1", "C", 5,
"2", "A", 1,
"2", "B", 5,
"2", "C", 2
)
example %>% group_by(Group) %>%
mutate(GroupStatus=ifelse(Value[Code=="C"]==5, 1, 0))
This gives the desired result:
Group Code Value GroupStatus
<chr> <chr> <dbl> <dbl>
1 1 A 1 1
2 1 B 1 1
3 1 C 5 1
4 2 A 1 0
5 2 B 5 0
6 2 C 2 0
The problem is when one of the groups is missing Code C, as below:
example2 <- tibble::tribble(
~Group, ~Code, ~Value,
"1", "A", 1,
"1", "B", 1,
"1", "C", 5,
"2", "A", 1,
"2", "B", 5
)
example2 %>% group_by(Group) %>%
mutate(GroupStatus=ifelse(Value[Code=="C"]==5, 1, 0))
This gives me an error: Error: Problem with mutate() column GroupStatus.
i GroupStatus = ifelse(Value[Code == "C"] == 5, 1, 0).
i GroupStatus must be size 2 or 1, not 0.
i The error occurred in group 2: Group = "2".
What I'd like is for "GroupStatus" in any group that is missing Code C to just be set to zero. Is that possible?
Another possible solution, based on a nested ifelse:
library(dplyr)
example2 <- tibble::tribble(
~Group, ~Code, ~Value,
"1", "A", 1,
"1", "B", 1,
"1", "C", 5,
"2", "A", 1,
"2", "B", 5
)
example2 %>%
group_by(Group) %>%
mutate(GroupStatus = ifelse("C" %in% Code,
ifelse(Value[Code == "C"] == 5, 1, 0), 0)) %>%
ungroup
#> # A tibble: 5 × 4
#> Group Code Value GroupStatus
#> <chr> <chr> <dbl> <dbl>
#> 1 1 A 1 1
#> 2 1 B 1 1
#> 3 1 C 5 1
#> 4 2 A 1 0
#> 5 2 B 5 0
You really only have a single condition to check per group, so we can simplify to an any() instead of ifelse():
example2 %>%
group_by(Group) %>%
mutate(GroupStatus = as.integer(any(Value == 5 & Code == "C")))
# # A tibble: 5 × 4
# # Groups: Group [2]
# Group Code Value GroupStatus
# <chr> <chr> <dbl> <dbl>
# 1 1 A 1 1
# 2 1 B 1 1
# 3 1 C 5 1
# 4 2 A 1 0
# 5 2 B 5 0
Related
The data that I have:
x = tibble(
study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012)
)
The goal is to create the 'number' variable which shows the same number for each unique ID in sequence starting from 1.
goal = tibble(
study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012),
number = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4)
)
And then if within each ID group, the studies are incomplete (e.g., for number = 2, the studies are only A and B, instead of A, B, C), then how to remove the obs associated with that ID (e.g., remove obs that have a number of '2')?
Thanks!
Updated follow-up question on part B:
Once we have the goal dataset, I would like to remove the obs grouped by ID, that meet the following requirements in terms of the study variable:
A and D are required, one of B and C is required (so either B or C), and sometimes each letter will appear more than once.
x = tibble(
study = c("A", "B", "C", "D", "A", "B", "A", "B", "C", "A", "B", "C", "D", "D", "A", "B", "D", "B", "C", "D"),
ID = c(001, 001, 001, 001, 005, 005, 007, 007, 007, 012, 012, 012, 012, 012, 013, 013, 013, 018, 018, 018),
number = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6)
)
So in the goal dataset above, I would like to remove:
(1) Obs #5 and 6 which share a group number of 2, because they don't have A, B or C, and D in the study variable.
(2) Obs #18, 19, 20 which share a group number of 6, for the same reason as (1).
I would like to keep the rest of the obs because within each number group, they have A, B or C, and D. I cannot use filter(n() > 3) here, because that would delete obs with the number 5.
We could use cur_group_id()
library(dplyr)
x %>%
group_by(ID) %>%
mutate(number = cur_group_id())
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
6 A 7 3
7 B 7 3
8 C 7 3
9 A 12 4
10 B 12 4
OR
library(dplyr)
x %>%
mutate(number = cumsum(ID != lag(ID, default = first(ID)))+1)
study ID number
<chr> <dbl> <dbl>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
6 A 7 3
7 B 7 3
8 C 7 3
9 A 12 4
10 B 12 4
A) The dplyr package offers group_indices() for adding unique group indentifiers:
library(dplyr)
df$number <- df %>%
group_indices(ID)
df
# A tibble: 10 × 3
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
...
B) You can drop observations where the group size is less than 3 (i.e., "A", "B" and "C") with filter():
df %>%
group_by(ID) %>%
filter(n() == 3)
# A tibble: 6 × 3
# Groups: ID [2]
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 7 3
5 B 7 3
6 C 7 3
A and D are required, one of B and C is required (so either B or C)
df %>%
group_by(ID) %>%
mutate(
flag =
(
any(study %in% c("A")) &
any(study %in% c("D"))
) &
(
any(study %in% c("B")) |
any(study %in% c("C"))
)
) %>%
filter(flag)
# A tibble: 12 × 4
# Groups: ID [3]
study ID number flag
<chr> <dbl> <dbl> <lgl>
1 A 1 1 TRUE
2 B 1 1 TRUE
3 C 1 1 TRUE
4 D 1 1 TRUE
5 A 12 4 TRUE
6 B 12 4 TRUE
7 C 12 4 TRUE
8 D 12 4 TRUE
9 D 12 4 TRUE
10 A 13 5 TRUE
11 B 13 5 TRUE
12 D 13 5 TRUE
I have a dataframe as below, with all the values corresponding to an 'other' type, belonging to specific IDs:
df <- data.frame(ID = c("1", "1", "1", "2", "2", "3"), type = c("oth", "oth", "oth", "oth", "oth", "oth"), value = c("A", "B", "B", "C", "D", "D"))
ID type value
1 oth A
1 oth B
1 oth B
2 oth C
2 oth D
3 oth D
I would like to change the types of the rows with values A, B, C to be 1, 2, 3 respectively (D stays as 'oth'). If it is changed, I would like to keep the 'oth' row but have the value as NA.
The above df would result into:
df2 <- data.frame(ID = c("1", "1", "1", "1", "1", "1", "2", "2", "2", "3"), type = c("1", "oth", "2", "oth", "2", "oth", "3", "oth", "oth", "oth"), value = c("A", NA, "B", NA, "B", NA, "C", NA, "D", "D"))
ID type value
1 1 A
1 oth <NA>
1 2 B
1 oth <NA>
1 2 B
1 oth <NA>
2 3 C
2 oth <NA>
2 oth D
3 oth D
Note that any rows that match A,B,C will create a new row with correct type, but change the original one to value = NA. If possible, a dplyr solution would be preferred.
Any help would be appreciated, thanks!
You can create a vector of values to change and filter (values). Filter those values and replace value column to NA. Use match to change 'A' to 1, 'B' to 2 and 'C' to 3. Bind the two dataframes together.
library(dplyr)
values <- c('A', 'B', 'C')
df %>%
filter(value %in% values) %>%
mutate(value = NA) %>%
bind_rows(df %>%
mutate(type = match(value, values),
type = replace(type, is.na(type), 'oth'))) %>%
arrange(ID, type)
# ID type value
#1 1 1 A
#2 1 2 B
#3 1 2 B
#4 1 oth <NA>
#5 1 oth <NA>
#6 1 oth <NA>
#7 2 3 C
#8 2 oth <NA>
#9 2 oth D
#10 3 oth D
You may try this way
df
rbind(df,
df%>%
filter(value %in% c("A", "B", "C")) %>%
mutate(type = case_when(value == "A" ~ 1,
value == "B" ~ 2,
value == "C" ~ 3),
value = NA)) %>%
arrange(ID)
ID type value
1 1 oth A
2 1 oth B
3 1 oth B
4 1 1 <NA>
5 1 2 <NA>
6 1 2 <NA>
7 2 oth C
8 2 oth D
9 2 3 <NA>
10 3 oth D
That would be my approach, I didn't use dplyr but the order seemed important
my_df <- data.frame(ID = c("1", "1", "1", "2", "2", "3"), type = c("oth", "oth", "oth", "oth", "oth", "oth"), value = c("A", "B", "B", "C", "D", "D"))
my_var <- which(my_df$value %in% c("A", "B", "C"))
if (length(my_var)) {
my_temp <- my_df[my_var,]
}
my_var <- which(my_temp$value == "A")
if (length(my_var)) {
my_temp[my_var, "type"] <- 1
}
my_var <- which(my_temp$value == "B")
if (length(my_var)) {
my_temp[my_var, "type"] <- 2
}
my_var <- which(my_temp$value == "C")
if (length(my_var)) {
my_temp[my_var, "type"] <- 3
}
my_df <- rbind(my_temp, my_df)
my_df <- my_df[order(my_df$ID, my_df$value),]
my_var <- which(my_df$type == "oth" & my_df$value %in% c("A", "B", "C"))
if (length(my_var)) {
my_df[my_var, "value"] <- NA
}
Here is potentially another dplyr option.
First, create a vector vec with the specific categories to match and obtain numeric values for.
Then, you can create groups based on whether each row value is contained within the vector vec. This allow you to insert rows, combining rows with rbind.
Within each group, the first row will have the type converted to a number, and the remaining row(s) with either NA for value (if value is in vec) or keep the same value.
This seems to work with your example data. Let me know if this meets your needs.
library(dplyr)
vec <- c("A", "B", "C")
df %>%
group_by(grp = cumsum(value %in% vec)) %>%
do(rbind(
mutate(head(., 1), type = match(value, vec)),
mutate(., value = ifelse(value %in% vec, NA, value)))) %>%
ungroup() %>%
select(-grp)
Output
ID type value
<chr> <chr> <chr>
1 1 1 A
2 1 oth NA
3 1 2 B
4 1 oth NA
5 1 2 B
6 1 oth NA
7 2 3 C
8 2 oth NA
9 2 oth D
10 3 oth D
Here is my data
id<- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3")
behav1<- c("A", "C", "C", "B", "C", "C", "A", "A", "A", "B", "B", "A")
behav2<- c("C", "A", "A", "B", "A", "B", "A", "B", "C", "B", "B", "C")
df <- data.frame(id, cond1, cond2)
I want to make a transition table to see how many times people will do from behav1>behav2
for example,
subj no. 1 did A to C (one time), C to A(two times), B to B(one time)
I'd like to make a transition table by counts for each subject
Here is my code
df %>%
group_by(id) %>%
summarise(c = as.data.frame.matrix(table(df$behav1, df$behav2)))
However, what I got is total counts by all subjects
What did I do wrong?
Thanks for the help in advance!
Will this work:
library(dplyr)
library(purrr)
map(df %>% group_by(id) %>% group_split(.keep = 0), table)
[[1]]
behav2
behav1 A B C
A 0 0 1
B 0 1 0
C 2 0 0
[[2]]
behav2
behav1 A B
A 1 1
C 1 1
[[3]]
behav2
behav1 B C
A 0 2
B 2 0
It's probably easier to not use table() here. Instead, group_by() all 3 columns (giving you groups for each unique person and transition), then summarize:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
id <- c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3")
behav1 <- c("A", "C", "C", "B", "C", "C", "A", "A", "A", "B", "B", "A")
behav2 <- c("C", "A", "A", "B", "A", "B", "A", "B", "C", "B", "B", "C")
df <- data.frame(id, behav1, behav2)
(long_result <- df |>
group_by(id, behav1, behav2) |>
summarize(n = n(), .groups = "drop")
)
#> # A tibble: 9 x 4
#> id behav1 behav2 n
#> <chr> <chr> <chr> <int>
#> 1 1 A C 1
#> 2 1 B B 1
#> 3 1 C A 2
#> 4 2 A A 1
#> 5 2 A B 1
#> 6 2 C A 1
#> 7 2 C B 1
#> 8 3 A C 2
#> 9 3 B B 2
It might be easier to present this in wide format:
(wide_result <- long_result |>
tidyr::pivot_wider(
id_cols = "id",
names_from = c("behav1", "behav2"),
names_sep = " -> ",
values_from = "n"
) |>
mutate(across(everything(), tidyr::replace_na, replace = 0))
)
#> # A tibble: 3 x 7
#> id `A -> C` `B -> B` `C -> A` `A -> A` `A -> B` `C -> B`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 2 0 0 0
#> 2 2 0 0 1 1 1 1
#> 3 3 2 2 0 0 0 0
Created on 2021-05-30 by the reprex package (v2.0.0)
I have a data frame that contains around 700 cases with 1800 examinations. Some cases underwent several different modalities. I want to leave only one examination result based on the specific condition of the modality.
Here is a dummy data frame:
df <- data.frame (ID = c("1", "1", "1", "2", "2", "3", "4", "4", "5", "5"),
c1 = c("A", "B", "C", "A", "C", "A", "A", "B", "B", "C"),
x1 = c(5, 4, 5, 3, 1, 3, 4, 2, 3, 5),
x2 = c(4, 3, 7, 9, 1, 2, 4, 7, 5, 0))
There are five cases with 10 exams. [c1] is the exam modality (condition), and the results are x1 and x2.
I want to leave only one row based on the following condition:
C > B > A
I want to leave the row with C first; if not, leave the row with B; If C and B are absent, leave the row with A.
Desired output:
output <- data.frame (ID = c("1", "2", "3", "4", "5"),
c1 = c("C", "C", "A", "B", "C"),
x1 = c(5, 1, 3, 2, 5),
x2 = c(7, 1, 2, 7, 0))
You can arrange the data based on required correct order and for each ID select it's 1st row.
library(dplyr)
req_order <- c('C', 'B', 'A')
df %>%
arrange(ID, match(c1, req_order)) %>%
distinct(ID, .keep_all = TRUE)
# ID c1 x1 x2
# <chr> <chr> <dbl> <dbl>
#1 1 C 5 7
#2 2 C 1 1
#3 3 A 3 2
#4 4 B 2 7
#5 5 C 5 0
In base R, this can be written as :
df1 <- df[order(match(df$c1, req_order)), ]
df1[!duplicated(df1$ID), ]
Here is one approach:
df.srt <- df[order(df$c1, decreasing=TRUE), ]
df.spl <- split(df.srt, df.srt$ID)
first <- lapply(df.spl, head, n=1)
result <- do.call(rbind, first)
result
# ID c1 x1 x2
# 1 1 C 5 7
# 2 2 C 1 1
# 3 3 A 3 2
# 4 4 B 2 7
# 5 5 C 5 0
I want to conditionally summarize several variables by group. The following code does that, but I'm not sure how to do this without specifying each variable and the conditions in the summarize step.
library(tidyverse)
dat <- data.frame(group = c("A", "A", "A", "B", "B", "B"),
indicator = c(1, 2, 3, 1, 2, 3),
var1 = c(1, 0, 1, 2, 1, 2),
var2 = c(1, 0, 1, 1, 2, 1))
# dat
# group indicator var1 var2
#1 A 1 1 1
#2 A 2 0 0
#3 A 3 1 1
#4 B 1 2 1
#5 B 2 1 2
#6 B 3 2 1
dat %>%
group_by(group) %>%
summarise(var1 = sum(var1[indicator==1 | indicator==2]),
var2 = sum(var2[indicator==1 | indicator==2]))
# A tibble: 2 x 3
# group var1 var2
#* <chr> <dbl> <dbl>
#1 A 1 1
#2 B 3 3
Use across :
library(dplyr)
dat %>%
group_by(group) %>%
summarise(across(starts_with('var'), ~sum(.[indicator %in% 1:2])))
# group var1 var2
#* <chr> <dbl> <dbl>
#1 A 1 1
#2 B 3 3