Flagging an id based on another column has different values in R - r

I have a flagging rule need to apply.
Here is how my dataset looks like:
df <- data.frame(id = c(1,1,1,1, 2,2,2,2, 3,3,3,3),
key = c("a","a","b","c", "a","b","c","d", "a","b","c","c"),
form = c("A","B","A","A", "A","A","A","A", "B","B","B","A"))
> df
id key form
1 1 a A
2 1 a B
3 1 b A
4 1 c A
5 2 a A
6 2 b A
7 2 c A
8 2 d A
9 3 a B
10 3 b B
11 3 c B
12 3 c A
I would like to flag ids based on a key columns that has duplicates, a third column of form shows different forms for each key. The idea is to understand if an id has taken any items from multiple forms. I need to add a filtering column as below:
> df.1
id key form type
1 1 a A multiple
2 1 a B multiple
3 1 b A multiple
4 1 c A multiple
5 2 a A single
6 2 b A single
7 2 c A single
8 2 d A single
9 3 a B multiple
10 3 b B multiple
11 3 c B multiple
12 3 c A multiple
And eventually I need to get rid off the extra duplicated row which has different form. To decide which of the duplicated one drops, I pick whichever the form type has more items.
In a final separate dataset, I would like to have something like below:
> df.2
id key form type
1 1 a A multiple
3 1 b A multiple
4 1 c A multiple
5 2 a A single
6 2 b A single
7 2 c A single
8 2 d A single
9 3 a B multiple
10 3 b B multiple
11 3 c B multiple
So first id has form A dominant so kept the A, and the third id has form B dominant so kept the B.
Any ideas?
Thanks!

We can check number of distinct elements to create the new column by group and then filter based on the highest frequency (Mode)
library(dplyr)
df.2 <- df %>%
group_by(id) %>%
mutate(type = if(n_distinct(form) > 1) 'multiple' else 'single') %>%
filter(form == Mode(form)) %>%
ungroup
-output
> df.2
# A tibble: 10 × 4
id key form type
<dbl> <chr> <chr> <chr>
1 1 a A multiple
2 1 b A multiple
3 1 c A multiple
4 2 a A single
5 2 b A single
6 2 c A single
7 2 d A single
8 3 a B multiple
9 3 b B multiple
10 3 c B multiple
where
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

Related

R determine sequence with dplyr using group by

For the following data frame I would like to determine the sequence for column Drug grouped by ID
where the order should be based on column dat (earliest date shoud be the first in the sequence). In the initial df, some IDs have 1 row and some have more than 1 (in this case ID 1 & 5).
df <- data.frame(ID = c(1,1,2,3,4,5,5,6,7,8),
dat = seq(as.Date("2021-01-01"), as.Date("2021-03-05"), by="weeks"),
drug = c("A","A","B","C","B","B","C","D","C","B"))
The desired output should be
ID seq1
1 1 A,A
2 2 B
3 3 C
4 4 B
5 5 B,C
6 6 D
7 7 C
8 8 B

count unique combinations of variable values in an R dataframe column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Count number of rows within each group
(17 answers)
Closed 2 years ago.
I want to count the unique combinations of a variable that appear per group.
For example:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,4,4,4,5,6,6,7,7,7),
status = c("a","b","c","a","b","c","b","c","b","c","d","b","b","c","b","c", "d"))
> df
id status
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
7 3 b
8 3 c
9 4 b
10 4 c
11 4 d
12 5 b
13 6 b
14 6 c
15 7 b
16 7 c
17 7 d
So that, for example, I can tally how many times a given combination of "status" appears.
By hand, for example, I see that "a,b,c" appears twice total (id's 1 and 2).
These seem to be similar questions, but I couldn't work out how to do it and with clearer explanation in R:
Counting unique combinations
Count of unique combinations despite order
The result I think I am looking for would be something like:
abc 2
bc 3
b 1
...
An option with tidyverse where group by 'id', paste the 'status' and get the count
library(dplyr)
library(stringr)
df %>%
group_by(id) %>%
summarise(status = str_c(status, collapse="")) %>%
count(status)
# A tibble: 4 x 2
# status n
# <chr> <int>
#1 abc 2
#2 b 1
#3 bc 2
#4 bcd 2
Here is a base R option via aggregate
> aggregate(.~status,rev(aggregate(.~id,df,paste0,collapse = "")),length)
status id
1 abc 2
2 b 1
3 bc 2
4 bcd 2
You can use the apply family of functions too with tapply and lapply to get there with table.
tap <- tapply(df$status, df$id ,FUN= function(x) unique(x))
lap <- lapply(tap,FUN = function(x) paste0(x,collapse=""))
status <- unlist(lap)
df1 <- data.frame(table(status))
> df1
status Freq
1 abc 2
2 b 1
3 bc 2
4 bcd 2

Long to short with data manipulation in r with 2 id pieces

In r a data set like below I want to create a variable that is prior minus post. I'll need to do some calculations by ID and later by group so I want to keep both.
Original
ID group time value
1 A prior 8
1 A post 5
2 A prior 4
2 A post 7
3 B prior 3
3 B post 10
4 B prior 5
4 B post 6
Desired data
ID group new_value
1 A -3
2 A 3
3 B 7
4 B 1
I think to get there I need to make my data like this
ID group value_prior value_post
1 A 8 5
2 A 4 7
3 B 3 10
4 B 5 6
But I'm not sure how to get there while preserving ID and group.
Assuming your data is already sorted, you could use:
aggregate(value ~ ID + group, df, diff)
ID group value
1 1 A -3
2 2 A 3
3 3 B 7
4 4 B 1
Or:
library(dplyr)
df %>%
group_by(ID, group) %>%
summarise(new_value = diff(value))

reshaping data with time represented as spells

I have a dataset in which time is represented as spells (i.e. from time 1 to time 2), like this:
d <- data.frame(id = c("A","A","B","B","C","C"),
t1 = c(1,3,1,3,1,3),
t2 = c(2,4,2,4,2,4),
value = 1:6)
I want to reshape this into a panel dataset, i.e. one row for each unit and time period, like this:
result <- data.frame(id = c("A","A","A","A","B","B","B","B","C","C","C","C"),
t= c(1:4,1:4,1:4),
value = c(1,1,2,2,3,3,4,4,5,5,6,6))
I am attempting to do this with tidyr and gather but not getting the desired result. I am trying something like this which is clearly wrong:
gather(d, 't1', 't2', key=t)
In the actual dataset the spells are irregular.
You were almost there.
Code
d %>%
# Gather the needed variables. Explanation:
# t_type: How will the call the column where we will put the former
# variable names under?
# t: How will we call the column where we will put the
# values of above variables?
# -id,
# -value: Which columns should stay the same and NOT be gathered
# under t_type (key) and t (value)?
#
gather(t_type, t, -id, -value) %>%
# Select the right columns in the right order.
# Watch out: We did not select t_type, so it gets dropped.
select(id, t, value) %>%
# Arrange / sort the data by the following columns.
# For a descending order put a "-" in front of the column name.
arrange(id, t)
Result
id t value
1 A 1 1
2 A 2 1
3 A 3 2
4 A 4 2
5 B 1 3
6 B 2 3
7 B 3 4
8 B 4 4
9 C 1 5
10 C 2 5
11 C 3 6
12 C 4 6
So, the goal is to melt t1 and t2 columns and to drop the key column that will appear as a result. There are a couple of options. Base R's reshape seems to be tedious. We may, however, use melt:
library(reshape2)
melt(d, measure.vars = c("t1", "t2"), value.name = "t")[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4
where -3 drop the key column. We may indeed also use gather as in
gather(d, "key", "t", t1, t2)[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4

Replace values in a series exceeding a threshold

In a dataframe I'd like to replace values in a series where they exceed a given threshold.
For example, within a group ('ID') in a series designated by 'time', if 'value' ever exceeds 3, I'd like to make all following entries also equal 3.
ID <- as.factor(c(rep("A", 3), rep("B",3), rep("C",3)))
time <- rep(1:3, 3)
value <- c(c(1,1,2), c(2,3,2), c(3,3,2))
dat <- cbind.data.frame(ID, time, value)
dat
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 2
C 1 3
C 2 3
C 3 2
I'd like it to be:
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 3
C 1 3
C 2 3
C 3 3
This should be easy, but I can't figure it out. Thanks!
The ave function makes this very easy by allowing you to apply a function to each of the groupings. In this case, we will adapth the cummax (cumulative maximum) to see if we've seen a 3 yet.
dat$value2<-with(dat, ave(value, ID, FUN=
function(x) ifelse(cummax(x)>=3, 3, x)))
dat;
# ID time value value2
# 1 A 1 1 1
# 2 A 2 1 1
# 3 A 3 2 2
# 4 B 1 2 2
# 5 B 2 3 3
# 6 B 3 2 3
# 7 C 1 3 3
# 8 C 2 3 3
# 9 C 3 2 3
You could also just use FUN=cummax if you want never-decreasing values. I wasn't sure about the sequence c(1,2,1) if you wanted to keep that unchanged or not.
If you can assume your data are sorted by group, then this should be fast, essentially relying on findInterval() behind the scenes:
library(IRanges)
id <- Rle(ID)
three <- which(value>=3L)
ir <- reduce(IRanges(three, end(id)[findRun(three, id)])))
dat$value[as.integer(ir)] <- 3L
This avoids looping over the groups.

Resources