subsetting duplicates per individual

subsetting duplicates per individual - r

dfin <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 1 0 20 20
1 2 1 20 20
Per study and ID, for those who have duplicate CYCLE == 0 values, remove the row that had the higher TIME.
dfout <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 2 1 20 20
Using RStudio.

An option is to do a group by 'STUDY', 'ID' and filter out the duplicated 0 values in 'CYCLE'
library(dplyr)
dfin %>%
arrange(STUDY, ID, TIME) %>%
group_by(STUDY, ID) %>%
filter(!(duplicated(CYCLE) & CYCLE == 0))
# A tibble: 2 x 5
# Groups: STUDY, ID [2]
# STUDY ID CYCLE TIME VALUE
# <int> <int> <int> <int> <int>
#1 1 1 0 10 50
#2 1 2 1 20 20
Also, if there are many duplicates for 0 and want to remove only the row where 'TIME' is also max
dfin %>%
group_by(STUDY, ID) %>%
filter(!(TIME == max(TIME) & CYCLE == 0))
Or using base R
dfin1 <- do.call(order, dfin[c("STUDY", "ID", "TIME")])
dfin1[!(duplicated(dfin1[1:3]) & duplicated(dfin1$CYCLE)),]
# STUDY ID CYCLE TIME VALUE
#1 1 1 0 10 50
#3 1 2 1 20 20
data
dfin <- structure(list(STUDY = c(1L, 1L, 1L), ID = c(1L, 1L, 2L), CYCLE = c(0L,
0L, 1L), TIME = c(10L, 20L, 20L), VALUE = c(50L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-3L))

Related

Identify unique values within a multivariable subset

I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?

library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))

Trying to find occurrences of ID that meets sequential conditions in R

I'm trying to return a logical vector based on whether a person meets one set of conditions and ALSO meets another set of conditions later on. I'm using a data frame that looks like so:
Person.Id Year Term
250 1 3
250 1 1
250 2 3
300 1 3
511 2 1
300 1 5
700 2 3
What I want to return is a logical vector that indicates true/false if person ID 250 has year 1 and term 3, AND later has year 2 term 3. So a person that only has year 1 term 3 or year 1 term 5 will return false. Solutions in dplyr preferred! I feel like this is simple and I'm just missing something. I initially tried this code but all it returned was a blank df:
df2 <- df1 %>%
group_by(Person.Id) %>%
filter((year==1 & term==3) & (year==2 & term==3))

Are you looking for something like this ?
require(dplyr)
df %>%
group_by(Person.Id) %>%
mutate(count=sum((year==1 & term==3) | (year==2 & term==3))) %>%
mutate(count2=if_else(count==2,T,F))
# A tibble: 7 x 5
# Groups: Person.Id [4]
Person.Id year term count count2
<int> <int> <int> <int> <lgl>
1 250 1 3 2 TRUE
2 250 1 1 2 TRUE
3 250 2 3 2 TRUE
4 300 1 3 1 FALSE
5 511 2 1 0 FALSE
6 300 1 5 1 FALSE
7 700 2 3 1 FALSE

Maybe this can help:
#Data
Data <- structure(list(Person.Id = c(250L, 250L, 250L, 300L, 511L, 300L,
700L), Year = c(1L, 1L, 2L, 1L, 2L, 1L, 2L), Term = c(3L, 1L,
3L, 3L, 1L, 5L, 3L)), row.names = c(NA, -7L), class = "data.frame")
#Flags
cond1 <- Data$Year==1 & Data$Term==3
cond2 <- Data$Year==2 & Data$Term==3
#Replace
Data$Flag1 <- 0
Data$Flag1[cond1]<-1
Data$Flag2 <- 0
Data$Flag2[cond2]<-1
#Filter
Data %>% group_by(Person.Id) %>% filter(Flag1==1 | Flag2==1)
# A tibble: 4 x 5
# Groups: Person.Id [3]
Person.Id Year Term Flag1 Flag2
<int> <int> <int> <dbl> <dbl>
1 250 1 3 1 0
2 250 2 3 0 1
3 300 1 3 1 0
4 700 2 3 0 1

R find consecutive months

I'd like to find consecutive month by client. I thought this is easy but
still can't find solutions..
My goal is to find months' consecutive purchases for each client. Any
My data
Client Month consecutive
A 1 1
A 1 2
A 2 3
A 5 1
A 6 2
A 8 1
B 8 1

In base R, we can use ave
df$consecutive <- with(df, ave(Month, Client, cumsum(c(TRUE, diff(Month) > 1)),
FUN = seq_along))
df
# Client Month consecutive
#1 A 1 1
#2 A 1 2
#3 A 2 3
#4 A 5 1
#5 A 6 2
#6 A 8 1
#7 B 8 1
In dplyr, we can create a new group with lag to compare the current month with the previous month and assign row_number() in each group.
library(dplyr)
df %>%
group_by(Client,group=cumsum(Month-lag(Month, default = first(Month)) > 1)) %>%
mutate(consecutive = row_number()) %>%
ungroup %>%
select(-group)

We can create a grouping variable based on the difference in adjacent 'Month' for each 'Client' and use that to create the sequence
library(dplyr)
df1 %>%
group_by(Client) %>%
group_by(grp =cumsum(c(TRUE, diff(Month) > 1)), add = TRUE) %>%
mutate(consec = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 7 x 4
# Client Month consecutive consec
# <chr> <int> <int> <int>
#1 A 1 1 1
#2 A 1 2 2
#3 A 2 3 3
#4 A 5 1 1
#5 A 6 2 2
#6 A 8 1 1
#7 B 8 1 1
Or using data.table
library(data.table)
setDT(df1)[, grp := cumsum(c(TRUE, diff(Month) > 1)), Client
][, consec := seq_len(.N), .(Client, grp)
][, grp := NULL][]
data
df1 <- structure(list(Client = c("A", "A", "A", "A", "A", "A", "B"),
Month = c(1L, 1L, 2L, 5L, 6L, 8L, 8L), consecutive = c(1L,
2L, 3L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-7L))

Create incremental value with restart with condition within ID

so I have a data of 2 fields, ID and Timestamp
ID Time
1 12
1 15
1 16
2 12
2 11
And i want to increment if the difference between time and previous time is inferior to 2 for example within the same ID, unless stay at the same value and restart at 1 when ID is different.
Desired output:
ID Time ID_SESSION
1 12 1
1 15 1
1 16 2
2 12 1
2 11 1
It would be needed in dplyr/sparklyr for spark implementation with R/

A one-liner using base R,
with(df, ave(Time, ID, FUN = function(i)cumsum(c(TRUE, diff(i) <= 2))))
#[1] 1 1 2 1 2

May be we need
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(ID_SESSION = (lag(c(FALSE, diff(Time) > 2), default= FALSE)) + 1)
Or in a one-liner with data.table
library(data.table)
setDT(df1)[, ID_SESSION := shift(c(FALSE, diff(Time) > 2), fill = FALSE) + 1, ID]
df1
# ID Time ID_SESSION
#1: 1 12 1
#2: 1 15 1
#3: 1 16 2
#4: 2 12 1
#5: 2 11 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Time = c(12L, 15L,
16L, 12L, 11L)), class = "data.frame", row.names = c(NA, -5L))

Add column based on other columns values

I am honest, I could come up with a decent title for this.
Basically, I have a dateframe:
ID Qty BasePrice Total
1 2 30 50
1 1 20 20
2 4 5 15
For each line I want to calculate the following:
Result = (Qty * BasePrice) - Total
Which is supposedly easy to do in R. However, I want to group the results by ID (sum them).
Sample Output:
ID Qty BasePrice Total Results
1 2 30 50 10
1 1 20 20 10
2 4 5 15 5
For instance, for ID=1, the values represent ((2*30)-50)+((1*20)-20)
Any idea on how can I achieve this?
Thanks!

We can do a group_by sum of the difference between the product of 'Qty', 'BasePrice' with 'Total'
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Result = sum((Qty * BasePrice) - Total))
# A tibble: 3 x 5
# Groups: ID [2]
# ID Qty BasePrice Total Result
# <int> <int> <int> <int> <int>
#1 1 2 30 50 10
#2 1 1 20 20 10
#3 2 4 5 15 5
data
df1 <- structure(list(ID = c(1L, 1L, 2L), Qty = c(2L, 1L, 4L), BasePrice = c(30L,
20L, 5L), Total = c(50L, 20L, 15L)), class = "data.frame", row.names = c(NA,
-3L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

subsetting duplicates per individual - r

dfin <- STUDY ID CYCLE TIME VALUE 1 1 0 10 50 1 1 0 20 20 1 2 1 20 20 Per study and ID, for those who have duplicate CYCLE == 0 values, remove the row that had the higher TIME. dfout <- STUDY ID CYCLE TIME VALUE 1 1 0 10 50 1 2 1 20 20 Using RStudio.

Related

Identify unique values within a multivariable subset

Trying to find occurrences of ID that meets sequential conditions in R

R find consecutive months

Create incremental value with restart with condition within ID

Add column based on other columns values

Categories

Resources