Create incremental value with restart with condition within ID - r

so I have a data of 2 fields, ID and Timestamp
ID Time
1 12
1 15
1 16
2 12
2 11
And i want to increment if the difference between time and previous time is inferior to 2 for example within the same ID, unless stay at the same value and restart at 1 when ID is different.
Desired output:
ID Time ID_SESSION
1 12 1
1 15 1
1 16 2
2 12 1
2 11 1
It would be needed in dplyr/sparklyr for spark implementation with R/

A one-liner using base R,
with(df, ave(Time, ID, FUN = function(i)cumsum(c(TRUE, diff(i) <= 2))))
#[1] 1 1 2 1 2

May be we need
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(ID_SESSION = (lag(c(FALSE, diff(Time) > 2), default= FALSE)) + 1)
Or in a one-liner with data.table
library(data.table)
setDT(df1)[, ID_SESSION := shift(c(FALSE, diff(Time) > 2), fill = FALSE) + 1, ID]
df1
# ID Time ID_SESSION
#1: 1 12 1
#2: 1 15 1
#3: 1 16 2
#4: 2 12 1
#5: 2 11 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Time = c(12L, 15L,
16L, 12L, 11L)), class = "data.frame", row.names = c(NA, -5L))

Related

Identify unique values within a multivariable subset

I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?
library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))

R function for collapsing multiple ranges of different columns from wide to long format?

I've a dataset with multiple different ranges of columns in each row (each row corresponds to one individual), as below. Each instance of the different column types have 3 levels (0,1 and 2).
id col1_0 col1_1 col1_2 col2_0 col2_1 col2_2 col3_0 col3_1 col3_2
1 0 1 3 2 2 3 3 4 5
2 1 1 2 2 4 7 4 5 5
.
.
etc.
What I would need is to collapse all col1 into one column, all col2 into another and all col3's into another, for each id. As below.
id x col1 col2 col4
1 0 0 2 3
1 1 1 2 4
1 2 3 3 5
2 0 1 2 4
2 1 1 4 5
2 2 1 7 5
.
.
etc.
In addition, I would also need to create an x-column with values 0,1 and 2, for each id. However, I only manage to collapse the first range of columns (col1) with the code below.
library(tidyverse)
longer_data <- dataframe %>%
group_by(id) %>%
pivot_longer(col1_0:col1_2, names_to = "x1", values_to = "col1")
x1 here creates a column with the original column names. So I would create need an additional x-column that only keeps the last numbers of the original column names.
Is there a way to achieve this? Many thanks in advance!
We don't need any group_by. It can be directly done with pivot_longer by specifying the names_sep and the .value in names_to. Note the order of .value and x. It implies the values of that column should go into the each of those prefixes before the _ and the new column with suffix stub goes into 'x'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -id, names_to = c('.value', 'x'), names_sep = "_")
-output
# A tibble: 6 x 5
# id x col1 col2 col3
# <int> <chr> <int> <int> <int>
#1 1 0 0 2 3
#2 1 1 1 2 4
#3 1 2 3 3 5
#4 2 0 1 2 4
#5 2 1 1 4 5
#6 2 2 2 7 5
data
df1 <- structure(list(id = 1:2, col1_0 = 0:1, col1_1 = c(1L, 1L), col1_2 = 3:2,
col2_0 = c(2L, 2L), col2_1 = c(2L, 4L), col2_2 = c(3L, 7L
), col3_0 = 3:4, col3_1 = 4:5, col3_2 = c(5L, 5L)),
class = "data.frame", row.names = c(NA,
-2L))
Here is a base R option using reshape, where timevar="x" creates a column named x, and sep="_" helps to fetch the last numbers of the original column names.
res <- reshape(
df,
direction = "long",
idvar = "id",
varying = -1,
timevar = "x",
sep = "_"
)
res <- res[order(res$id), ]
Output
> res
id x col1 col2 col3
1.0 1 0 0 2 3
1.1 1 1 1 2 4
1.2 1 2 3 3 5
2.0 2 0 1 2 4
2.1 2 1 1 4 5
2.2 2 2 2 7 5
Data
> dput(df)
structure(list(id = 1:2, col1_0 = 0:1, col1_1 = c(1L, 1L), col1_2 = 3:2,
col2_0 = c(2L, 2L), col2_1 = c(2L, 4L), col2_2 = c(3L, 7L
), col3_0 = 3:4, col3_1 = 4:5, col3_2 = c(5L, 5L)), class = "data.frame", row.names = c(NA,
-2L))

R find consecutive months

I'd like to find consecutive month by client. I thought this is easy but
still can't find solutions..
My goal is to find months' consecutive purchases for each client. Any
My data
Client Month consecutive
A 1 1
A 1 2
A 2 3
A 5 1
A 6 2
A 8 1
B 8 1
In base R, we can use ave
df$consecutive <- with(df, ave(Month, Client, cumsum(c(TRUE, diff(Month) > 1)),
FUN = seq_along))
df
# Client Month consecutive
#1 A 1 1
#2 A 1 2
#3 A 2 3
#4 A 5 1
#5 A 6 2
#6 A 8 1
#7 B 8 1
In dplyr, we can create a new group with lag to compare the current month with the previous month and assign row_number() in each group.
library(dplyr)
df %>%
group_by(Client,group=cumsum(Month-lag(Month, default = first(Month)) > 1)) %>%
mutate(consecutive = row_number()) %>%
ungroup %>%
select(-group)
We can create a grouping variable based on the difference in adjacent 'Month' for each 'Client' and use that to create the sequence
library(dplyr)
df1 %>%
group_by(Client) %>%
group_by(grp =cumsum(c(TRUE, diff(Month) > 1)), add = TRUE) %>%
mutate(consec = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 7 x 4
# Client Month consecutive consec
# <chr> <int> <int> <int>
#1 A 1 1 1
#2 A 1 2 2
#3 A 2 3 3
#4 A 5 1 1
#5 A 6 2 2
#6 A 8 1 1
#7 B 8 1 1
Or using data.table
library(data.table)
setDT(df1)[, grp := cumsum(c(TRUE, diff(Month) > 1)), Client
][, consec := seq_len(.N), .(Client, grp)
][, grp := NULL][]
data
df1 <- structure(list(Client = c("A", "A", "A", "A", "A", "A", "B"),
Month = c(1L, 1L, 2L, 5L, 6L, 8L, 8L), consecutive = c(1L,
2L, 3L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-7L))

subsetting duplicates per individual

dfin <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 1 0 20 20
1 2 1 20 20
Per study and ID, for those who have duplicate CYCLE == 0 values, remove the row that had the higher TIME.
dfout <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 2 1 20 20
Using RStudio.
An option is to do a group by 'STUDY', 'ID' and filter out the duplicated 0 values in 'CYCLE'
library(dplyr)
dfin %>%
arrange(STUDY, ID, TIME) %>%
group_by(STUDY, ID) %>%
filter(!(duplicated(CYCLE) & CYCLE == 0))
# A tibble: 2 x 5
# Groups: STUDY, ID [2]
# STUDY ID CYCLE TIME VALUE
# <int> <int> <int> <int> <int>
#1 1 1 0 10 50
#2 1 2 1 20 20
Also, if there are many duplicates for 0 and want to remove only the row where 'TIME' is also max
dfin %>%
group_by(STUDY, ID) %>%
filter(!(TIME == max(TIME) & CYCLE == 0))
Or using base R
dfin1 <- do.call(order, dfin[c("STUDY", "ID", "TIME")])
dfin1[!(duplicated(dfin1[1:3]) & duplicated(dfin1$CYCLE)),]
# STUDY ID CYCLE TIME VALUE
#1 1 1 0 10 50
#3 1 2 1 20 20
data
dfin <- structure(list(STUDY = c(1L, 1L, 1L), ID = c(1L, 1L, 2L), CYCLE = c(0L,
0L, 1L), TIME = c(10L, 20L, 20L), VALUE = c(50L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-3L))

R: frequency with group by ID [duplicate]

This question already has answers here:
Frequency count of two column in R
(8 answers)
Closed 6 years ago.
I have a data frame like this:
ID Cont
1 a
1 a
1 b
2 a
2 c
2 d
I need to report the frequence of "Cont" by ID. The output should be
ID Cont Freq
1 a 2
1 b 1
2 a 1
2 c 1
2 d 1
Using dplyr, you can group_by both ID and Cont and summarise using n() to get Freq:
library(dplyr)
res <- df %>% group_by(ID,Cont) %>% summarise(Freq=n())
##Source: local data frame [5 x 3]
##Groups: ID [?]
##
## ID Cont Freq
## <int> <fctr> <int>
##1 1 a 2
##2 1 b 1
##3 2 a 1
##4 2 c 1
##5 2 d 1
Data:
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), Cont = structure(c(1L,
1L, 2L, 1L, 3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor")), .Names = c("ID",
"Cont"), class = "data.frame", row.names = c(NA, -6L))
## ID Cont
##1 1 a
##2 1 a
##3 1 b
##4 2 a
##5 2 c
##6 2 d
library(data.table)
setDT(x)[, .(Freq = .N), by = .(ID, Cont)]
# ID Cont Freq
# 1: 1 a 2
# 2: 1 b 1
# 3: 2 a 1
# 4: 2 c 1
# 5: 2 d 1
With base R:
df1 <- subset(as.data.frame(table(df)), Freq != 0)
if you want to order by ID, add this line:
df1[order(df1$ID)]
ID Cont Freq
1 1 a 2
3 1 b 1
2 2 a 1
6 2 c 1
8 2 d 1

Resources