R determine sequence with dplyr using group by - r

For the following data frame I would like to determine the sequence for column Drug grouped by ID
where the order should be based on column dat (earliest date shoud be the first in the sequence). In the initial df, some IDs have 1 row and some have more than 1 (in this case ID 1 & 5).
df <- data.frame(ID = c(1,1,2,3,4,5,5,6,7,8),
dat = seq(as.Date("2021-01-01"), as.Date("2021-03-05"), by="weeks"),
drug = c("A","A","B","C","B","B","C","D","C","B"))
The desired output should be
ID seq1
1 1 A,A
2 2 B
3 3 C
4 4 B
5 5 B,C
6 6 D
7 7 C
8 8 B

Related

Flagging an id based on another column has different values in R

I have a flagging rule need to apply.
Here is how my dataset looks like:
df <- data.frame(id = c(1,1,1,1, 2,2,2,2, 3,3,3,3),
key = c("a","a","b","c", "a","b","c","d", "a","b","c","c"),
form = c("A","B","A","A", "A","A","A","A", "B","B","B","A"))
> df
id key form
1 1 a A
2 1 a B
3 1 b A
4 1 c A
5 2 a A
6 2 b A
7 2 c A
8 2 d A
9 3 a B
10 3 b B
11 3 c B
12 3 c A
I would like to flag ids based on a key columns that has duplicates, a third column of form shows different forms for each key. The idea is to understand if an id has taken any items from multiple forms. I need to add a filtering column as below:
> df.1
id key form type
1 1 a A multiple
2 1 a B multiple
3 1 b A multiple
4 1 c A multiple
5 2 a A single
6 2 b A single
7 2 c A single
8 2 d A single
9 3 a B multiple
10 3 b B multiple
11 3 c B multiple
12 3 c A multiple
And eventually I need to get rid off the extra duplicated row which has different form. To decide which of the duplicated one drops, I pick whichever the form type has more items.
In a final separate dataset, I would like to have something like below:
> df.2
id key form type
1 1 a A multiple
3 1 b A multiple
4 1 c A multiple
5 2 a A single
6 2 b A single
7 2 c A single
8 2 d A single
9 3 a B multiple
10 3 b B multiple
11 3 c B multiple
So first id has form A dominant so kept the A, and the third id has form B dominant so kept the B.
Any ideas?
Thanks!
We can check number of distinct elements to create the new column by group and then filter based on the highest frequency (Mode)
library(dplyr)
df.2 <- df %>%
group_by(id) %>%
mutate(type = if(n_distinct(form) > 1) 'multiple' else 'single') %>%
filter(form == Mode(form)) %>%
ungroup
-output
> df.2
# A tibble: 10 × 4
id key form type
<dbl> <chr> <chr> <chr>
1 1 a A multiple
2 1 b A multiple
3 1 c A multiple
4 2 a A single
5 2 b A single
6 2 c A single
7 2 d A single
8 3 a B multiple
9 3 b B multiple
10 3 c B multiple
where
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

count unique combinations of variable values in an R dataframe column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Count number of rows within each group
(17 answers)
Closed 2 years ago.
I want to count the unique combinations of a variable that appear per group.
For example:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,4,4,4,5,6,6,7,7,7),
status = c("a","b","c","a","b","c","b","c","b","c","d","b","b","c","b","c", "d"))
> df
id status
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
7 3 b
8 3 c
9 4 b
10 4 c
11 4 d
12 5 b
13 6 b
14 6 c
15 7 b
16 7 c
17 7 d
So that, for example, I can tally how many times a given combination of "status" appears.
By hand, for example, I see that "a,b,c" appears twice total (id's 1 and 2).
These seem to be similar questions, but I couldn't work out how to do it and with clearer explanation in R:
Counting unique combinations
Count of unique combinations despite order
The result I think I am looking for would be something like:
abc 2
bc 3
b 1
...
An option with tidyverse where group by 'id', paste the 'status' and get the count
library(dplyr)
library(stringr)
df %>%
group_by(id) %>%
summarise(status = str_c(status, collapse="")) %>%
count(status)
# A tibble: 4 x 2
# status n
# <chr> <int>
#1 abc 2
#2 b 1
#3 bc 2
#4 bcd 2
Here is a base R option via aggregate
> aggregate(.~status,rev(aggregate(.~id,df,paste0,collapse = "")),length)
status id
1 abc 2
2 b 1
3 bc 2
4 bcd 2
You can use the apply family of functions too with tapply and lapply to get there with table.
tap <- tapply(df$status, df$id ,FUN= function(x) unique(x))
lap <- lapply(tap,FUN = function(x) paste0(x,collapse=""))
status <- unlist(lap)
df1 <- data.frame(table(status))
> df1
status Freq
1 abc 2
2 b 1
3 bc 2
4 bcd 2

R: Filtering by two columns using "is not equal" operator dplyr/subset

This questions must have been answered before but I cannot find it any where. I need to filter/subset a dataframe using values in two columns to remove them. In the examples I want to keep all the rows that are not equal (!=) to both replicate "1" and treatment "a". However, either subset and filter functions remove all replicate 1 and all treatment a. I could solve it by using which and then indexing, but it is not the best way for using pipe operator. do you know why filter/subset do not filter only when both conditions are true?
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
#filtering data
> filter(df, replicate!=1, treatment!="a")
replicate treatment
1 2 b
2 3 b
3 2 b
4 3 b
> subset(df, (replicate!=1 & treatment!="a"))
replicate treatment
8 2 b
9 3 b
11 2 b
12 3 b
#solution by which - indexing
index = which(df$replicate==1 & df$treatment=="a")
> df[-index,]
replicate treatment
2 2 a
3 3 a
5 2 a
6 3 a
7 1 b
8 2 b
9 3 b
10 1 b
11 2 b
12 3 b
I think you're looking to use an "or" condition here. How does this look:
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
df %>%
filter(replicate != 1 | treatment != "a")
replicate treatment
1 2 a
2 3 a
3 2 a
4 3 a
5 1 b
6 2 b
7 3 b
8 1 b
9 2 b
10 3 b

R Sum columns by index

I need to find a way to sum columns by their index,I'm working on a bigread.csv file, I'll show here a sample of the problem; I'd like for example to sum from the 2nd to the 5th and from the 6th to the 7h the following matrix:
a 1 3 3 4 5 6
b 2 1 4 3 4 1
c 1 3 2 1 1 5
d 2 2 4 3 1 3
The result has to be like this:
a 11 11
b 10 5
c 7 6
d 8 4
The columns have all different names
We can use rowSums on the subset of columns i.e 2:5 and 6:7 separately and then create a new data.frame with the output.
data.frame(df1[1], Sum1=rowSums(df1[2:5]), Sum2=rowSums(df1[6:7]))
# id Sum1 Sum2
#1 a 11 11
#2 b 10 5
#3 c 7 6
#4 d 11 4
The package dplyr has a function exactly made for that purpose:
require(dplyr)
df1 = data.frame(a=c(1,2,3,4,3,3),b=c(1,2,3,2,1,2),c=c(1,2,3,21,2,3))
df2 = df1 %>% transmute(sum1 = a+b , sum2 = b+c)
df2 = df1 %>% transmute(sum1 = .[[1]]+.[[2]], sum2 = .[[2]]+.[[3]])

Order multiple rows in a data frame

I have a data frame like this
ID EPOCH
B 2
B 3
A 1
A 2
A 3
C 0
and what I would like to do is to order it by the ID first appearance date (i.e. the minimum value of EPOCH for each ID) so that I get
ID EPOCH
C 0
A 1
A 2
A 3
B 2
B 3
I managed only to order the data frame according to Epoch and than ID
df[order(df$EPOCH,df$ID),]
but than it is no more clustered by ID, i.e.
C 0
A 1
A 2
B 2
A 3
B 3
Many thanks
First add a column with the minimum EPOCH for each ID to the data.frame:
data <- read.table(textConnection("ID EPOCH
B 2
B 3
A 1
A 2
A 3
C 0"), header=TRUE)
a <- aggregate(data$EPOCH, data["ID"], min)
names(a)[2] <- "min_EPOCH"
data <- merge(data, a)
Then sort on that new column:
o <- order(data$min_EPOCH, data$ID, data$EPOCH)
data[o, ]

Resources