Conditional cumulative mean for each group in R - r

I have a data set that looks like this:
id a b
1 AA 2
1 AB 5
1 AA 1
2 AB 2
2 AB 4
3 AB 4
3 AB 3
3 AA 1
I need to calculate the cumulative mean for each record within each group and excluding the case where a == 'AA', So sample output should be:
id a b mean
1 AA 2 -
1 AB 5 5
1 AA 1 5
2 AB 2 2
2 AB 4 (4+2)/2
3 AB 4 4
3 AB 3 (4+3)/2
3 AA 1 (4+3)/2
3 AA 4 (4+3)/2
I tried to achieve it using dplyr and cummean by getting an error.
df <- df %>%
group_by(id) %>%
mutate(mean = cummean(b[a != 'AA']))
Error: incompatible size (123), expecting 147 (the group size) or 1
Can you suggest a better way to achieve the same in R ?

The trick here is to reconstruct the cummean by dividing the adjusted cumsum by the adjusted count. As a one-liner:
df %>% group_by(id) %>% mutate(cumsum(b * (a != 'AA')) / cumsum(a != 'AA'))
We can make this a little nicer (the "multiply by a!='AA' - magic!" is the ugliness in my mind) by taking out the a != 'AA' as a column
df %>%
group_by(id) %>%
mutate(relevance = 0+(a!='AA'),
mean = cumsum(relevance * b) / cumsum(relevance))

There may be an easier approach. Here, we group by 'id'. Create a new column 'Mean' by first converting the elements in 'b' that corresponds to 'AA' in 'a' to NA (b*NA^(a=='AA')). NA^(a=='AA') gives an output of NA for 'AA' in 'a' and 1 for all other values. So, when we multiply by 'b', it replaces the 1 with the values in 'b' while NA remains as such. We use na.aggregate to replace the 'NA' with the mean of non-NA elements in each group, then wrap with cummean to get the cumulative mean. If the first value in each group for 'a' is 'AA', we can get NA for that by multiplying with NA^(row_number()==1 & a=='AA').
library(zoo)
library(dplyr)
df %>%
group_by(id) %>%
mutate(Mean= cummean(na.aggregate(b*NA^(a=='AA')))*
NA^(row_number()==1 & a=='AA'))
# Source: local data frame [9 x 4]
#Groups: id [3]
# id a b Mean
# (int) (chr) (int) (dbl)
#1 1 AA 2 NA
#2 1 AB 5 5.0
#3 1 AA 1 5.0
#4 2 AB 2 2.0
#5 2 AB 4 3.0
#6 3 AB 4 4.0
#7 3 AB 3 3.5
#8 3 AA 1 3.5
#9 3 AA 4 3.5
data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
a = c("AA",
"AB", "AA", "AB", "AB", "AB", "AB", "AA", "AA"), b = c(2L, 5L,
1L, 2L, 4L, 4L, 3L, 1L, 4L)), .Names = c("id", "a", "b"),
class = "data.frame", row.names = c(NA, -9L))

Related

Aggregating rows across multiple values

I have a large dataframe with approximately this pattern:
Person
Rate
Street
a
b
c
d
e
f
A
2
XYZ
1
NULL
3
4
5
NULL
A
2
XYZ
NULL
2
NULL
NULL
NULL
NULL
A
3
XYZ
NULL
NULL
NULL
NULL
NULL
6
B
2
DEF
NULL
NULL
NULL
NULL
5
NULL
B
2
DEF
NULL
2
3
NULL
NULL
6
C
1
DEF
1
2
3
4
5
6
A, b, c, d, e, f represents about 600 columns.
I am trying to combine the columns so that each person becomes one line, rows a-f combine into a single line using sum, and any conflicting rate or street information becomes a new row. So the data should look something like this:
Person
Rate
Rate 2
Street
a
b
c
d
e
f
A
2
3
XYZ
1
2
3
4
5
6
B
2
DEF
NULL
2
3
NULL
5
6
C
1
DEF
1
2
3
4
5
6
I keep trying to make this work with aggregate and summarize but I'm not sure that's the right approach.
Thank you very much for your help!
First we pivot all the unique rates per person and street.
library(reshape2)
tmp1=dcast(unique(df[,c("Person","Rate","Street")]),Person+Street~Rate,value.var="Rate")
colnames(tmp1)[-c(1:2)]=paste("Rate",colnames(tmp1)[-c(1:2)])
Then we aggregate and sum by person and rate, columns 4 to 9, from "a" to "f", change accordingly.
tmp2=aggregate(df[,4:9],list(Person=df$Person,Street=df$Street),function(x){
ifelse(all(is.na(x)),NA,sum(x,na.rm=T))
})
And finally merge the two.
merge(tmp1,tmp2,by=c("Person","Street"))
Person Street Rate 1 Rate 2 Rate 3 a b c d e f
1 A XYZ NA 2 3 1 2 3 4 5 6
2 B DEF NA 2 NA NA 2 3 NA 5 6
3 C DEF 1 NA NA 1 2 3 4 5 6
Perhaps, you can do this in two-step process -
library(dplyr)
library(tidyr)
#sum columns a-f
table1 <- df %>%
group_by(Person) %>%
summarise(across(a:f, sum, na.rm = TRUE))
#Remove duplicated values and get the data in separate columns
#for Rate and Street columns.
table2 <- df %>%
group_by(Person) %>%
mutate(across(c(Rate, Street), ~replace(., duplicated(.), NA))) %>%
select(Person, Rate, Street) %>%
filter(if_any(c(Rate, Street), ~!is.na(.))) %>%
mutate(col = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col, values_from = c(Rate, Street)) %>%
select(where(~any(!is.na(.))))
#Join the two data to get final result
inner_join(table1, table2, by = 'Person')
# Person a b c d e f Rate_1 Rate_2 Street_1
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <chr>
#1 A 1 2 3 4 5 6 2 3 XYZ
#2 B 0 2 3 0 5 6 2 NA DEF
#3 C 1 2 3 4 5 6 1 NA DEF
data
It is helpful and easier to help when you share data in a reproducible format which can be copied directly. I have used the below data for the answer.
df <- structure(list(Person = c("A", "A", "A", "B", "B", "C"), Rate = c(2L,
2L, 3L, 2L, 2L, 1L), Street = c("XYZ", "XYZ", "XYZ", "DEF", "DEF",
"DEF"), a = c(1L, NA, NA, NA, NA, 1L), b = c(NA, 2L, NA, NA,
2L, 2L), c = c(3L, NA, NA, NA, 3L, 3L), d = c(4L, NA, NA, NA,
NA, 4L), e = c(5L, NA, NA, 5L, NA, 5L), f = c(NA, NA, 6L, NA,
6L, 6L)), row.names = c(NA, -6L), class = "data.frame")

Remove a group from data frame if all values across are a certain number? [duplicate]

This question already has answers here:
dplyr filter columns with value 0 for all rows with unique combinations of other columns
(2 answers)
Closed 1 year ago.
I have a data frame where I'd like to remove entire groups if their y value is the same across 6 time points.
Patients
Time
Status
1
a
5
1
b
5
1
c
5
1
d
5
1
e
5
1
f
5
2
a
4
2
b
4
2
c
5
2
d
5
2
e
5
2
f
5
Basically, I'd like to remove all patients from this data frame who have a status of "5" at ALL time points. If a patient has any value apart from 5 at any point in time I'd like to include them.
I tried
df <- df %>%
filter(a !=5 & b !=5 & c !=5 & d !=5 & e !=5 & f !=5)
To no avail, unfortunately. Would appreciate any help. Thank you!
You can use any/all :
library(dplyr)
df %>% group_by(Patients) %>% filter(any(Status != 5))
#With `all`
#df %>% group_by(Patients) %>% filter(!all(Status == 5))
# Patients Time Status
# <int> <chr> <int>
#1 2 a 4
#2 2 b 4
#3 2 c 5
#4 2 d 5
#5 2 e 5
#6 2 f 5
This can be also be written with base R :
subset(df, ave(Status != 5, Patients, FUN = any))
#and `data.table` :
library(data.table)
setDT(df)[, .SD[any(Status != 5)], Patients]
Without grouping by Patients you can do :
subset(df, Patients %in% unique(Patients[Status != 5]))
data
df <- structure(list(Patients = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), Time = c("a", "b", "c", "d", "e", "f", "a", "b",
"c", "d", "e", "f"), Status = c(5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L,
5L, 5L, 5L, 5L)), row.names = c(NA, -12L), class = "data.frame")
Something like this?
df <- data.frame(
patients = c(rep(1,6),rep(2,6)),
time = rep(letters[1:6],2),
status = c(rep(5,6),rep(4,2),rep(5,4))
)
df %>%
group_by(patients) %>%
dplyr::filter(status*6 != sum(status))
if I understood your problem correctly one of these two solutions should help:
library(dplyr)
library(data.table)
# your test data
df <- data.table::fread("Patients Time Status
1 a 5
1 b 5
1 c 5
1 d 5
1 e 5
1 f 5
2 a 4
2 b 4
2 c 5
2 d 5
2 e 5
2 f 5")
# one option to get all rows diferent than 5
df %>%
# exclude everything where Status is 5
dplyr::filter(Status != 5)
Patients Time Status
1: 2 a 4
2: 2 b 4
# one option to get all distinct patients
df %>%
# exclude everything where Status is 5
dplyr::filter(Status != 5) %>%
# unique values per column or column combination
dplyr::distinct(Patients)
Patients
1: 2
# on option to get all data of patien with at least one status != 5
df %>%
# exclude everything where Status is 5
dplyr::filter(Status != 5) %>%
# unique values per column or column combination
dplyr::distinct(Patients) %>%
# join back on original data to get all values for specific patients
dplyr::inner_join(df, by = "Patients")
Patients Time Status
1: 2 a 4
2: 2 b 4
3: 2 c 5
4: 2 d 5
5: 2 e 5
6: 2 f 5

Count distinct in R groupby, first spliting the cells by ","?

I have data in format given below
a
b
1
A,B
1
A
1
B
2
A,B
2
D,C
2
A
2
A
What I need is when groupby column 'a' need the distinct values of column 'b'
a
count
1
2
2
4
Because for 1 we only have 2 distinct values, i.e. A,B
but for 2 we have 4 ,i.e. A,B,C,D.
I can first explode the data in tall format and then do the groupby, but since I have few other aggregation to be done so I was thinking of way to do in one line.
Thanks in advance
We can use aggregate in base R :
aggregate(b~a,df, function(x) length(unique(unlist(strsplit(x, ',')))))
# a b
#1 1 2
#2 2 4
data
df <- structure(list(a = c(1L, 1L, 1L, 2L, 2L, 2L, 2L), b = c("A,B",
"A", "B", "A,B", "D,C", "A", "A")), class = "data.frame", row.names = c(NA, -7L))
Using tidyr::separate_rows and dplyr::n_distinct this could be achieved like so:
library(dplyr)
d %>%
tidyr::separate_rows(b) %>%
group_by(a) %>%
summarise(count = n_distinct(b))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> a count
#> <int> <int>
#> 1 1 2
#> 2 2 4
DATA
d <- read.table(text = "a b
1 A,B
1 A
1 B
2 A,B
2 D,C
2 A
2 A", header = TRUE)
Base R using Map():
setNames(do.call(c, Map(function(x){length(unique(trimws(unlist(strsplit(x, ",")))))},
with(df, split(b, a)))), names(df))

Finding minimum by groups and among columns

I am trying to find the minimum value among different columns and group.
A small sample of my data looks something like this:
group cut group_score_1 group_score_2
1 a 1 3 5.0
2 b 2 2 4.0
3 a 0 2 2.5
4 b 3 5 4.0
5 a 2 3 6.0
6 b 1 5 1.0
I want to group by the groups and for each group, find the row which contains the minimum group score among both group scores and then also get the name of the column which contains the minimum (group_score_1 or group_score_2),
so basically my result should be something like this:
group cut group_score_1 group_score_2
1 a 0 2 2.5
2 b 1 5 1.0
I tried a few ideas, and came up eventually to dividing the into several new data frames, filtering by group and selecting the relevant columns and then using which.min(), but I'm sure there's a much more efficient way to do it. Not sure what I am missing.
We can use data.table methods
library(data.table)
setDT(df)[df[, .I[which.min(do.call(pmin, .SD))],
group, .SDcols = patterns('^group_score')]$V1]
# group cut group_score_1 group_score_2
#1: a 0 2 2.5
#2: b 1 5 1.0
For each group, you can calculate min value and select the row in which that value exist in one of the column.
library(dplyr)
df %>%
group_by(group) %>%
filter({tmp = min(group_score_1, group_score_2);
group_score_1 == tmp | group_score_2 == tmp})
# group cut group_score_1 group_score_2
# <chr> <int> <int> <dbl>
#1 a 0 2 2.5
#2 b 1 5 1
The above works well when you have only two group_score columns. If you have many such columns it is not possible to list down each one of them with group_score_1 == tmp | group_score_2 == tmp etc. In such case, get the data in long format and get the corresponding cut value of the minimum value and join the data. Assuming cut is unique in each group.
df %>%
tidyr::pivot_longer(cols = starts_with('group_score')) %>%
group_by(group) %>%
summarise(cut = cut[which.min(value)]) %>%
left_join(df, by = c("group", "cut"))
Here is a base R option using pmin + ave + subset
subset(
df,
as.logical(ave(
do.call(pmin, df[grep("group_score_\\d+", names(df))]),
group,
FUN = function(x) x == min(x)
))
)
which gives
group cut group_score_1 group_score_2
3 a 0 2 2.5
6 b 1 5 1.0
Data
> dput(df)
structure(list(group = c("a", "b", "a", "b", "a", "b"), cut = c(1L,
2L, 0L, 3L, 2L, 1L), group_score_1 = c(3L, 2L, 2L, 5L, 3L, 5L
), group_score_2 = c(5, 4, 2.5, 4, 6, 1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

R find consecutive months

I'd like to find consecutive month by client. I thought this is easy but
still can't find solutions..
My goal is to find months' consecutive purchases for each client. Any
My data
Client Month consecutive
A 1 1
A 1 2
A 2 3
A 5 1
A 6 2
A 8 1
B 8 1
In base R, we can use ave
df$consecutive <- with(df, ave(Month, Client, cumsum(c(TRUE, diff(Month) > 1)),
FUN = seq_along))
df
# Client Month consecutive
#1 A 1 1
#2 A 1 2
#3 A 2 3
#4 A 5 1
#5 A 6 2
#6 A 8 1
#7 B 8 1
In dplyr, we can create a new group with lag to compare the current month with the previous month and assign row_number() in each group.
library(dplyr)
df %>%
group_by(Client,group=cumsum(Month-lag(Month, default = first(Month)) > 1)) %>%
mutate(consecutive = row_number()) %>%
ungroup %>%
select(-group)
We can create a grouping variable based on the difference in adjacent 'Month' for each 'Client' and use that to create the sequence
library(dplyr)
df1 %>%
group_by(Client) %>%
group_by(grp =cumsum(c(TRUE, diff(Month) > 1)), add = TRUE) %>%
mutate(consec = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 7 x 4
# Client Month consecutive consec
# <chr> <int> <int> <int>
#1 A 1 1 1
#2 A 1 2 2
#3 A 2 3 3
#4 A 5 1 1
#5 A 6 2 2
#6 A 8 1 1
#7 B 8 1 1
Or using data.table
library(data.table)
setDT(df1)[, grp := cumsum(c(TRUE, diff(Month) > 1)), Client
][, consec := seq_len(.N), .(Client, grp)
][, grp := NULL][]
data
df1 <- structure(list(Client = c("A", "A", "A", "A", "A", "A", "B"),
Month = c(1L, 1L, 2L, 5L, 6L, 8L, 8L), consecutive = c(1L,
2L, 3L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-7L))

Resources