How to deduplicate based upon an interval between dates in same column - r

I have a table that looks something like this:
ID Date Type
1 2019/03/12 A
1 2019/03/12 A
2 2019/01/07 A
2 2019/04/20 B
3 2019/02/09 C
4 2019/01/19 A
4 2019/01/23 A
I want to deduplicate this table by ID, but only if the span between the dates listed is greater than 7 days. If it is less than 7 days, then I want to keep the earliest date.
Want:
ID Date Type
1 2019/03/12 A
2 2019/01/07 A
2 2019/04/20 B
3 2019/02/09 C
4 2019/01/19 A
I'm just struggling with where to start conceptually.

An option would be to convert the 'Date' to Date class (ymd from lubridate is used here), then grouped by 'ID', filter the difference of 'Date' that is greater than or equal to 7
library(dplyr)
library(lubridate)
df1 %>%
mutate(Date = ymd(Date)) %>%
group_by(ID) %>%
filter(c(TRUE, diff(Date) >= 7))
# A tibble: 5 x 3
# Groups: ID [4]
# ID Date Type
# <int> <date> <chr>
#1 1 2019-03-12 A
#2 2 2019-01-07 A
#3 2 2019-04-20 B
#4 3 2019-02-09 C
#5 4 2019-01-19 A
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 4L, 4L), Date = c("2019/03/12",
"2019/03/12", "2019/01/07", "2019/04/20", "2019/02/09", "2019/01/19",
"2019/01/23"), Type = c("A", "A", "A", "B", "C", "A", "A")),
class = "data.frame", row.names = c(NA,
-7L))

Related

Return value closest to date between 2 tables

I have 2 tables, both have a common ID that needs to be used to retrieve another value closest to the first table's date column.
Table_1
ID
date_1
1
2/3/2021
2
4/19/2019
3
1/6/2020
Table_2
ID
date_2
value
1
2/1/2021
x
1
4/19/2021
y
1
1/6/2020
z
2
5/19/2019
g
2
4/11/2019
a
3
4/11/2019
bb
3
7/17/2019
cc
3
1/16/2020
dd
And the goal is to add another column to table_1 to return the value from table_2 for the same ID that is closest to the date. In other words, I need to return the value from table_2 that shares the same ID value and has the minimum difference between date_1 and date_2.
Ex-
ID
date_1
result
1
2/3/2021
x
2
4/19/2021
a
3
1/6/2020
dd
There was an index match result I was able to find in excel but I would like to do this in R. Unsure if JOIN would be the best way or there's a more iterative way to solve this.
Please help?
Here's a way using dplyr and lubridate.
library(dplyr)
library(lubridate)
table_1 <- read.table(text='
ID date_1
1 2/3/2021
2 4/19/2019
3 1/6/2020', header=T)
table_1$date_1 <- mdy(table_1$date_1)
table_2 <- read.table(text='
ID date_2 value
1 2/1/2021 x
1 4/19/2021 y
1 1/6/2020 z
2 5/19/2019 g
2 4/11/2019 a
3 4/11/2019 bb
3 7/17/2019 cc
3 1/16/2020 dd', header=T)
table_2$date_2 <- mdy(table_2$date_2)
new_table_1 <-
table_2 %>%
left_join(table_1, by = 'ID') %>%
mutate(result = abs(date_2 - date_1)) %>%
group_by(ID) %>%
slice(which.min(result)) %>%
select(ID, date_1, value)
new_table_1
# A tibble: 3 x 3
# Groups: ID [3]
ID date_1 value
<int> <date> <chr>
1 1 2021-02-03 x
2 2 2019-04-19 a
3 3 2020-01-06 dd
You can also use the following solution. It is essential that you transform your date columns to date class before using this code:
library(dplyr)
library(lubridate)
Table_1 %>%
mutate(date_1 = mdy(date_1)) %>%
rowwise() %>%
mutate(Min = Table_2$value[Table_2$ID == ID][which.min(abs(date_1 - Table_2$date_2[Table_2$ID == ID]))])
# A tibble: 3 x 3
# Rowwise:
ID date_1 Min
<int> <date> <chr>
1 1 2021-02-03 x
2 2 2019-04-19 a
3 3 2020-01-06 dd
Date
Table_2
structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), date_2 = structure(c(18659,
18736, 18267, 18035, 17997, 17997, 18094, 18277), class = "Date"),
value = c("x", "y", "z", "g", "a", "bb", "cc", "dd")), class = "data.frame", row.names = c(NA,
-8L))

Count distinct in R groupby, first spliting the cells by ","?

I have data in format given below
a
b
1
A,B
1
A
1
B
2
A,B
2
D,C
2
A
2
A
What I need is when groupby column 'a' need the distinct values of column 'b'
a
count
1
2
2
4
Because for 1 we only have 2 distinct values, i.e. A,B
but for 2 we have 4 ,i.e. A,B,C,D.
I can first explode the data in tall format and then do the groupby, but since I have few other aggregation to be done so I was thinking of way to do in one line.
Thanks in advance
We can use aggregate in base R :
aggregate(b~a,df, function(x) length(unique(unlist(strsplit(x, ',')))))
# a b
#1 1 2
#2 2 4
data
df <- structure(list(a = c(1L, 1L, 1L, 2L, 2L, 2L, 2L), b = c("A,B",
"A", "B", "A,B", "D,C", "A", "A")), class = "data.frame", row.names = c(NA, -7L))
Using tidyr::separate_rows and dplyr::n_distinct this could be achieved like so:
library(dplyr)
d %>%
tidyr::separate_rows(b) %>%
group_by(a) %>%
summarise(count = n_distinct(b))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> a count
#> <int> <int>
#> 1 1 2
#> 2 2 4
DATA
d <- read.table(text = "a b
1 A,B
1 A
1 B
2 A,B
2 D,C
2 A
2 A", header = TRUE)
Base R using Map():
setNames(do.call(c, Map(function(x){length(unique(trimws(unlist(strsplit(x, ",")))))},
with(df, split(b, a)))), names(df))

R: How to find differing values in one column based on multiple other columns

newbie R question:
So say I have a dataframe with 3 columns: id, date, and value.
How do I capture, for each id, if they have values that are different but only if the dates are different.
For example (below), id 1 would be a miss here (different value but same date), but id 2 would be a hit (different value on different dates). Id 3 would be a miss since the values don't differ.
id date value
1 1/1/2000 A
1 1/1/2000 B
2 1/1/2000 A
2 1/1/1999 B
3 1/1/2000 A
3 1/1/1999 A
After grouping by 'id', check whether there are more than one unique 'date' as well as on 'value' column and pass that in filter
library(dplyr)
df1 %>%
group_by(id) %>%
filter(n_distinct(date) > 1, n_distinct(value) > 1)
-output
# A tibble: 2 x 3
# Groups: id [1]
# id date value
# <int> <chr> <chr>
#1 2 1/1/2000 A
#2 2 1/1/1999 B
Or with anyDuplicated
df1 %>%
group_by(id) %>%
filter(!anyDuplicated(date), !anyDuplicated(value))
# A tibble: 2 x 3
# Groups: id [1]
# id date value
# <int> <chr> <chr>
#1 2 1/1/2000 A
#2 2 1/1/1999 B
data
df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L), date = c("1/1/2000",
"1/1/2000", "1/1/2000", "1/1/1999", "1/1/2000", "1/1/1999"),
value = c("A", "B", "A", "B", "A", "A")),
class = "data.frame", row.names = c(NA,
-6L))

Transpose Rows in batches to Columns in R

My data.frame df looks like this:
A 1
A 2
A 5
B 2
B 3
B 4
C 3
C 7
C 9
I want it to look like this:
A B C
1 2 3
2 3 7
5 4 9
I have tried spread() but probably not in the right way. Any ideas?
We can use unstack from base R
unstack(df1, col2 ~ col1)
# A B C
#1 1 2 3
#2 2 3 7
#3 5 4 9
Or with split
data.frame(split(df1$col2, df1$col1))
Or if we use spread or pivot_wider, make sure to create a sequence column
library(dplyr)
library(tidyr)
df1 %>%
group_by(col1) %>%
mutate(rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col1, values_from = col2) %>%
# or use
# spread(col1, col2) %>%
select(-rn)
# A tibble: 3 x 3
# A B C
# <int> <int> <int>
#1 1 2 3
#2 2 3 7
#3 5 4 9
Or using dcast
library(data.table)
dcast(setDT(df1), rowid(col1) ~ col1)[, .(A, B, C)]
data
df1 <- structure(list(col1 = c("A", "A", "A", "B", "B", "B", "C", "C",
"C"), col2 = c(1L, 2L, 5L, 2L, 3L, 4L, 3L, 7L, 9L)),
class = "data.frame", row.names = c(NA,
-9L))
In data.table, we can use dcast :
library(data.table)
dcast(setDT(df), rowid(col1)~col1, value.var = 'col2')[, col1 := NULL][]
# A B C
#1: 1 2 3
#2: 2 3 7
#3: 5 4 9

R find consecutive months

I'd like to find consecutive month by client. I thought this is easy but
still can't find solutions..
My goal is to find months' consecutive purchases for each client. Any
My data
Client Month consecutive
A 1 1
A 1 2
A 2 3
A 5 1
A 6 2
A 8 1
B 8 1
In base R, we can use ave
df$consecutive <- with(df, ave(Month, Client, cumsum(c(TRUE, diff(Month) > 1)),
FUN = seq_along))
df
# Client Month consecutive
#1 A 1 1
#2 A 1 2
#3 A 2 3
#4 A 5 1
#5 A 6 2
#6 A 8 1
#7 B 8 1
In dplyr, we can create a new group with lag to compare the current month with the previous month and assign row_number() in each group.
library(dplyr)
df %>%
group_by(Client,group=cumsum(Month-lag(Month, default = first(Month)) > 1)) %>%
mutate(consecutive = row_number()) %>%
ungroup %>%
select(-group)
We can create a grouping variable based on the difference in adjacent 'Month' for each 'Client' and use that to create the sequence
library(dplyr)
df1 %>%
group_by(Client) %>%
group_by(grp =cumsum(c(TRUE, diff(Month) > 1)), add = TRUE) %>%
mutate(consec = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 7 x 4
# Client Month consecutive consec
# <chr> <int> <int> <int>
#1 A 1 1 1
#2 A 1 2 2
#3 A 2 3 3
#4 A 5 1 1
#5 A 6 2 2
#6 A 8 1 1
#7 B 8 1 1
Or using data.table
library(data.table)
setDT(df1)[, grp := cumsum(c(TRUE, diff(Month) > 1)), Client
][, consec := seq_len(.N), .(Client, grp)
][, grp := NULL][]
data
df1 <- structure(list(Client = c("A", "A", "A", "A", "A", "A", "B"),
Month = c(1L, 1L, 2L, 5L, 6L, 8L, 8L), consecutive = c(1L,
2L, 3L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-7L))

Resources