Related
The data that I have:
x = tibble(
study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012)
)
The goal is to create the 'number' variable which shows the same number for each unique ID in sequence starting from 1.
goal = tibble(
study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012),
number = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4)
)
And then if within each ID group, the studies are incomplete (e.g., for number = 2, the studies are only A and B, instead of A, B, C), then how to remove the obs associated with that ID (e.g., remove obs that have a number of '2')?
Thanks!
Updated follow-up question on part B:
Once we have the goal dataset, I would like to remove the obs grouped by ID, that meet the following requirements in terms of the study variable:
A and D are required, one of B and C is required (so either B or C), and sometimes each letter will appear more than once.
x = tibble(
study = c("A", "B", "C", "D", "A", "B", "A", "B", "C", "A", "B", "C", "D", "D", "A", "B", "D", "B", "C", "D"),
ID = c(001, 001, 001, 001, 005, 005, 007, 007, 007, 012, 012, 012, 012, 012, 013, 013, 013, 018, 018, 018),
number = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6)
)
So in the goal dataset above, I would like to remove:
(1) Obs #5 and 6 which share a group number of 2, because they don't have A, B or C, and D in the study variable.
(2) Obs #18, 19, 20 which share a group number of 6, for the same reason as (1).
I would like to keep the rest of the obs because within each number group, they have A, B or C, and D. I cannot use filter(n() > 3) here, because that would delete obs with the number 5.
We could use cur_group_id()
library(dplyr)
x %>%
group_by(ID) %>%
mutate(number = cur_group_id())
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
6 A 7 3
7 B 7 3
8 C 7 3
9 A 12 4
10 B 12 4
OR
library(dplyr)
x %>%
mutate(number = cumsum(ID != lag(ID, default = first(ID)))+1)
study ID number
<chr> <dbl> <dbl>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
6 A 7 3
7 B 7 3
8 C 7 3
9 A 12 4
10 B 12 4
A) The dplyr package offers group_indices() for adding unique group indentifiers:
library(dplyr)
df$number <- df %>%
group_indices(ID)
df
# A tibble: 10 × 3
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
...
B) You can drop observations where the group size is less than 3 (i.e., "A", "B" and "C") with filter():
df %>%
group_by(ID) %>%
filter(n() == 3)
# A tibble: 6 × 3
# Groups: ID [2]
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 7 3
5 B 7 3
6 C 7 3
A and D are required, one of B and C is required (so either B or C)
df %>%
group_by(ID) %>%
mutate(
flag =
(
any(study %in% c("A")) &
any(study %in% c("D"))
) &
(
any(study %in% c("B")) |
any(study %in% c("C"))
)
) %>%
filter(flag)
# A tibble: 12 × 4
# Groups: ID [3]
study ID number flag
<chr> <dbl> <dbl> <lgl>
1 A 1 1 TRUE
2 B 1 1 TRUE
3 C 1 1 TRUE
4 D 1 1 TRUE
5 A 12 4 TRUE
6 B 12 4 TRUE
7 C 12 4 TRUE
8 D 12 4 TRUE
9 D 12 4 TRUE
10 A 13 5 TRUE
11 B 13 5 TRUE
12 D 13 5 TRUE
I want to create assign column based on rank and limit by group.
In particular, for each group, I have a priority rank (e.g., 1,2,3 or 1,3,6 or 3,4,5 etc). Based on the rank (the small number is a priority), I want to allocate the resource given in limit column. Now I am doing this by hand. But I want to express this exercise using tidyverse. How do I allocate by mutate and group_by(or other methods)?
Using tidyverse, you can use top_n after grouping. This will filter the top values based on rank - where the n to keep in each group is determined by limit. Those kept will be assigned 1, and then merged with your original data.
Let me know if this provides the desired result.
library(tidyverse)
df %>%
group_by(group) %>%
top_n(limit[1], desc(rank)) %>%
mutate(assign = 1) %>%
right_join(df) %>%
replace_na(list(assign = 0)) %>%
arrange(group, rank)
Output
group rank limit assign
<chr> <dbl> <dbl> <dbl>
1 A 1 1 1
2 A 2 1 0
3 A 3 1 0
4 B 1 1 1
5 B 3 1 0
6 B 6 1 0
7 C 3 2 1
8 C 4 2 1
9 C 5 2 0
10 C 6 2 0
Data
df <- structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "C"), rank = c(1, 2, 3, 1, 3, 6, 3, 4, 5, 6), limit = c(1,
1, 1, 1, 1, 1, 2, 2, 2, 2)), class = "data.frame", row.names = c(NA,
-10L))
This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????
A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))
You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5
You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5
An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0
Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))
I am trying to compute a balance column.
So, to show an example, I want to go from this:
df <- data.frame(group = c("A", "A", "A", "A", "A"),
start = c(5, 0, 0, 0, 0),
receipt = c(1, 5, 6, 4, 6),
out = c(4, 5, 3, 2, 5))
> df
group start receipt out
1 A 5 1 4
2 A 0 5 5
3 A 0 6 3
4 A 0 4 2
5 A 0 6 5
to creating a new balance column like the following
> dfb
group start receipt out balance
1 A 5 1 4 2
2 A 0 5 5 2
3 A 0 6 3 5
4 A 0 4 2 7
5 A 0 6 5 8
I tried the following attempt but it isn't working
dfc <- df %>%
group_by(group) %>%
mutate(balance = if_else(row_number() == 1, start + receipt - out, (lag(balance) + receipt) - out)) %>%
ungroup()
Would really appreciate some help with this. Thanks!
You could use cumsum from dplyr. Note: I had to change your initial df table to match the one in your required result because you have different data in "out".
df <- data.frame(group = c("A", "A", "A", "A", "A"),
start = c(5, 0, 0, 0, 0),
receipt = c(1, 5, 6, 4, 6),
out = c(4, 5, 3, 2, 5))
dfc <- df %>%
group_by(group) %>%
mutate(balance=cumsum(start+receipt-out))
Source: local data frame [5 x 5]
Groups: group [1]
group start receipt out balance
<fctr> <dbl> <dbl> <dbl> <dbl>
1 A 5 1 4 2
2 A 0 5 5 2
3 A 0 6 3 5
4 A 0 4 2 7
5 A 0 6 5 8
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have data in this format:
How can I re-organize the data with R in the following format?
In other words: Create a new column for every single observation and paste a simple count if the observation occurs for the specific group.
This is most easily done using the tidyr package:
library(tidyr)
dat <- data.frame(letter = c("A", "A", "A", "A",
"B", "B", "B", "C",
"C", "C", "C", "D"),
number = c(2, 3, 4,5, 4, 5, 6, 1, 3, 5, 7, 1),
value = 1)
spread(dat, number, value)
dat <- data.frame(letter = c("A", "A", "A", "A",
"B", "B", "B", "C",
"C", "C", "C", "D"),
number = c(2, 3, 4,5, 4, 5, 6, 1, 3, 5, 7, 1))
I would like to provide an R base solution (maybe just for fun...), based on matrix indexing.
lev <- unique(dat[[1L]]); k <- length(lev) ## unique levels
x <- dat[[2L]]; p <- max(x) ## column position
z <- matrix(0L, nrow = k, ncol = p, dimnames = list(lev, seq_len(p))) ## initialization
z[cbind(match(dat[[1L]], lev), dat[[2L]])] <- 1L ## replacement
z ## display
# 1 2 3 4 5 6 7
#A 0 1 1 1 1 0 0
#B 0 0 0 1 1 1 0
#C 1 0 1 0 1 0 1
#D 1 0 0 0 0 0 0