Identifying duplicate within groups by latest date - r

I currently have a data frame that looks like this:
ID Value Date
1 1 A 1/1/2018
2 1 B 2/3/1988
3 1 B 6/3/1994
4 2 A 12/6/1999
5 2 B 24/12/1957
6 3 A 9/8/1968
7 3 B 20/9/2016
8 3 C 15/4/1993
9 3 C 9/8/1994
10 4 A 8/8/1988
11 4 C 6/4/2001
Within each ID I would like to identify a row where there is a duplicate Value. The Value that I would like to identify is the duplicate with the most recent Date.
The resulting data frame should look like this:
ID Value Date mostRecentDuplicate
1 1 A 1/1/2018 0
2 1 B 2/3/1988 0
3 1 B 6/3/1994 1
4 2 A 12/6/1999 0
5 2 B 24/12/1957 0
6 3 A 9/8/1968 0
7 3 B 20/9/2016 0
8 3 C 15/4/1993 0
9 3 C 9/8/1994 1
10 4 A 8/8/1988 0
11 4 C 6/4/2001 0`
How do I go about doing this?

Using dplyr we can first convert Date to actual date value, then group_by ID and Value and assign value 1 in the group where there is more than 1 row and the row_number is same as row number of maximum Date.
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%d/%m/%Y")) %>%
group_by(ID, Value) %>%
mutate(mostRecentDuplicate = +(n() > 1 & row_number() == which.max(Date))) %>%
ungroup()
# A tibble: 11 x 4
# ID Value Date mostRecentDuplicate
# <int> <fct> <date> <int>
# 1 1 A 2018-01-01 0
# 2 1 B 1988-03-02 0
# 3 1 B 1994-03-06 1
# 4 2 A 1999-06-12 0
# 5 2 B 1957-12-24 0
# 6 3 A 1968-08-09 0
# 7 3 B 2016-09-20 0
# 8 3 C 1993-04-15 0
# 9 3 C 1994-08-09 1
#10 4 A 1988-08-08 0
#11 4 C 2001-04-06 0

Related

Create column with ID starting at 1 and increment when value in another column changes in R

I have a data frame like so:
ID <- c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B')
val1 <- c(0,1,2,3,4,5,6,7,8,9,10,11,0,1,2,3)
val2 <- c(0,1,2,3,4,5,0,1,0,1,2,0,1,0,1,2)
df <- data.frame(ID, val1, val2)
Output:
ID val1 val2
1 A 0 0
2 A 1 1
3 A 2 2
4 A 3 3
5 A 4 4
6 A 5 5
7 A 6 0
8 A 7 1
9 A 8 0
10 A 9 1
11 A 10 2
12 B 11 0
13 B 0 1
14 B 1 0
15 B 2 1
16 B 3 2
I am trying to create a third column (val 3) which is like an index. When val1 = 0 and val 2 = 0 it should be 1 (this is also grouped by ID). It should stay as one and then increment by 1 until val2 = 0 again, like so showing desired output:
ID val1 val2 val3
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2
How can this be achieved? I tried:
df <- df %>%
group_by(ID, val2) %>%
mutate(val3 = row_number())
And:
df$val3 <- cumsum(c(1,diff(df$val2)==0))
But neither provide the desired outcome.
Inside cumsum use the logical comparison val2==0
df %>%
group_by(ID) %>%
mutate(val3 = cumsum(val2==0))
# A tibble: 16 × 4
# Groups: ID [2]
ID val1 val2 val3
<chr> <dbl> <dbl> <int>
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2

R Extract nested cumulatives from dataframe

given a dataframe
period<-c(1,1,1,3,3,3,3)
item<-c("a","b","b","a","b","c","c")
quantity<-c(1,3,2,4,5,3,7)
df<-data.frame(period,item,quantity)
df
period item quantity
1 1 a 1
2 1 b 3
3 1 b 2
4 3 a 4
5 3 b 5
6 3 c 3
7 3 c 7
I want to obtain
period item cumulative
1 a 1
1 b 5
1 c 0
2 a 0
2 b 0
2 c 0
3 a 4
3 b 5
3 c 10
Not sure what is a kind of efficient way to do this in R. The file has approx 500k records and 10,000 different items
Thanks!!
You can use complete to create the missing sequence of period and item and for each combination sum the quantity value.
library(dplyr)
library(tidyr)
df %>%
complete(period = min(period):max(period), item) %>%
group_by(period, item) %>%
summarise(quantity = sum(quantity, na.rm = TRUE)) %>%
ungroup
# period item quantity
# <dbl> <chr> <dbl>
#1 1 a 1
#2 1 b 5
#3 1 c 0
#4 2 a 0
#5 2 b 0
#6 2 c 0
#7 3 a 4
#8 3 b 5
#9 3 c 10

Rows sequence by group using two columns

Suppose I have the following df
data <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3),
Value = c(1,1,0,1,0,1,1,1,0,0,1,0,0,0),
Result = c(1,1,2,3,4,5,5,1,2,2,3,1,1,1))
How can I obtain column Result from the first two columns?
I have tried different approaches using rle, seq, cumsum and cur_group_id but can't get the Result column easily
library(data.table)
library(dplyr)
data %>%
group_by(ID) %>%
mutate(Result2 = rleid(Value))
This gives us:
ID Value Result Result2
<dbl> <dbl> <dbl> <int>
1 1 1 1 1
2 1 1 1 1
3 1 0 2 2
4 1 1 3 3
5 1 0 4 4
6 1 1 5 5
7 1 1 5 5
8 2 1 1 1
9 2 0 2 2
10 2 0 2 2
11 2 1 3 3
12 3 0 1 1
13 3 0 1 1
14 3 0 1 1
Does this work:
library(dplyr)
data %>% group_by(ID) %>% mutate(r = rep(seq_along(rle(ID*Value)$values), rle(ID*Value)$lengths))
# A tibble: 14 x 4
# Groups: ID [3]
ID Value Result r
<dbl> <dbl> <dbl> <int>
1 1 1 1 1
2 1 1 1 1
3 1 0 2 2
4 1 1 3 3
5 1 0 4 4
6 1 1 5 5
7 1 1 5 5
8 2 1 1 1
9 2 0 2 2
10 2 0 2 2
11 2 1 3 3
12 3 0 1 1
13 3 0 1 1
14 3 0 1 1
We could use rle with ave in base R
data$Result2 <- with(data, ave(Value, ID, FUN =
function(x) inverse.rle(within.list(rle(x), values <- seq_along(values)))))
data$Result2
#[1] 1 1 2 3 4 5 5 1 2 2 3 1 1 1

Assign sequential group ID given a group start indicator

I need to assign subgroup IDs given a group ID and an indicator showing the beginning of the new subgroup. Here's a test dataset:
group <- c(rep("A", 8), rep("B", 8))
x1 <- c(rep(0, 3), rep(1, 3), rep(0, 2))
x2 <- rep(0:1, 4)
df <- data.frame(group=group, indic=c(x1, x2))
Here is the resulting data frame:
df
group indic
1 A 0
2 A 0
3 A 0
4 A 1
5 A 1
6 A 1
7 A 0
8 A 0
9 B 0
10 B 1
11 B 0
12 B 1
13 B 0
14 B 1
15 B 0
16 B 1
indic==1 means that row is the beginning of a new subgroup, and the subgroup should be numbered 1 higher than the previous subgroup. Where indic==0 the subgroup should be the same as the previous subgroup. The subgroup numbering starts at 1. When the group variable changes, the subgroup numbering resets to 1. I would like to use the tidyverse framework.
Here is the result that I want:
df
group indic subgroup
1 A 0 1
2 A 0 1
3 A 0 1
4 A 1 2
5 A 1 3
6 A 1 4
7 A 0 4
8 A 0 4
9 B 0 1
10 B 1 2
11 B 0 2
12 B 1 3
13 B 0 3
14 B 1 4
15 B 0 4
16 B 1 5
I would like to be able to give some methods that I've tried already but didn't work, but I haven't been able to find anything even close. Any help will be appreciated.
You can just use
library(dplyr)
df %>% group_by(group) %>%
mutate(subgroup=cumsum(indic)+1)
# group indic subgroup
# <fct> <dbl> <dbl>
# 1 A 0 1
# 2 A 0 1
# 3 A 0 1
# 4 A 1 2
# 5 A 1 3
# 6 A 1 4
# 7 A 0 4
# 8 A 0 4
# 9 B 0 1
# 10 B 1 2
# 11 B 0 2
# 12 B 1 3
# 13 B 0 3
# 14 B 1 4
# 15 B 0 4
# 16 B 1 5
We use dplyr to do the grouping and then we just use cumsum with takes the cumulative sum of the indic column so each time it sees a 1 it increases.

Count number of values which are less than current value

I'd like to count the rows in the column input if the values are smaller than the current row (Please see the results wanted below). The issue to me is that the condition is based on current row value, so it is very different from general case where the condition is a fixed number.
data <- data.frame(input = c(1,1,1,1,2,2,3,5,5,5,5,6))
input
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 5
9 5
10 5
11 5
12 6
The results I expect to get are like this. For example, for observations 5 and 6 (with value 2), there are 4 observations with value 1 less than their value 2. Hence count is given value 4.
input count
1 1 0
2 1 0
3 1 0
4 1 0
5 2 4
6 2 4
7 3 6
8 5 7
9 5 7
10 5 7
11 5 7
12 6 11
Edit: as I am dealing with grouped data with dplyr, the ultimate results I wish to get is like below, that is, I am wishing the conditions could be dynamic within each group.
data <- data.frame(id = c(1,1,2,2,2,3,3,4,4,4,4,4),
input = c(1,1,1,1,2,2,3,5,5,5,5,6),
count=c(0,0,0,0,2,0,1,0,0,0,0,4))
id input count
1 1 1 0
2 1 1 0
3 2 1 0
4 2 1 0
5 2 2 2
6 3 2 0
7 3 3 1
8 4 5 0
9 4 5 0
10 4 5 0
11 4 5 0
12 4 6 4
Here is an option with tidyverse
library(tidyverse)
data %>%
mutate(count = map_int(input, ~ sum(.x > input)))
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
Update
With the updated data, add the group by 'id' in the above code
data %>%
group_by(id) %>%
mutate(count1 = map_int(input, ~ sum(.x > input)))
# A tibble: 12 x 4
# Groups: id [4]
# id input count count1
# <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0
# 2 1 1 0 0
# 3 2 1 0 0
# 4 2 1 0 0
# 5 2 2 2 2
# 6 3 2 0 0
# 7 3 3 1 1
# 8 4 5 0 0
# 9 4 5 0 0
#10 4 5 0 0
#11 4 5 0 0
#12 4 6 4 4
In base R, we can use sapply and for each input count how many values are greater than itself.
data$count <- sapply(data$input, function(x) sum(x > data$input))
data
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
With dplyr one way would be using rowwise function and following the same logic.
library(dplyr)
data %>%
rowwise() %>%
mutate(count = sum(input > data$input))
1. outer and rowSums
data$count <- with(data, rowSums(outer(input, input, `>`)))
2. table and cumsum
tt <- cumsum(table(data$input))
v <- setNames(c(0, head(tt, -1)), c(head(names(tt), -1), tail(names(tt), 1)))
data$count <- v[match(data$input, names(v))]
3. data.table non-equi join
Perhaps more efficient with a non-equi join in data.table. Count number of rows (.N) for each match (by = .EACHI).
library(data.table)
setDT(data)
data[data, on = .(input < input), .N, by = .EACHI]
If your data is grouped by 'id', as in your update, join on that variable as well:
data[data, on = .(id, input < input), .N, by = .EACHI]
# id input N
# 1: 1 1 0
# 2: 1 1 0
# 3: 2 1 0
# 4: 2 1 0
# 5: 2 2 2
# 6: 3 2 0
# 7: 3 3 1
# 8: 4 5 0
# 9: 4 5 0
# 10: 4 5 0
# 11: 4 5 0
# 12: 4 6 4

Resources