Random sampling one row within each id - r

I have data like this:
data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,4,4,4),
yearmonthweek=c(2012052,2012053,2012061,2012062,2013031,2013052,2013053,2012052,
2012053,2012054,2012071,2012073,2012074),
event=c(0,1,1,0,0,1,0,0,0,0,0,0,0),
a=c(11,12,13,10,11,12,15,14,13,15,19,10,20))
id stands for personal id. yearmonthweek means year, month and week. I want to clean data by the following rules. First, find id that have at least one event. In this case id=1 and 2 have events and id=3 and 4 have no events. Secondly, pick a random row from an id that has events and pick a random row from an id that has no events. So, the number of rows should be same as the number of id. My expected output looks like this:
data<-data.frame(id=c(1,2,3,4),
yearmonthweek=c(2012053,2013052,2012052,2012073),
event=c(1,1,0,0),
a=c(12,12,14,10))
Since I use random sampling, the values can be different as above, but there should be 4 rows like this.

Here is an option
set.seed(2022)
data %>%
group_by(id) %>%
mutate(has_event = any(event == 1)) %>%
filter(if_else(has_event, event == 1, event == 0)) %>%
slice_sample(n = 1) %>%
select(-has_event) %>%
ungroup()
## A tibble: 4 × 4
# id yearmonthweek event a
# <dbl> <dbl> <dbl> <dbl>
#1 1 2012061 1 13
#2 2 2013052 1 12
#3 3 2012053 0 13
#4 4 2012074 0 20
Explanation: Group by id, flag if a group has at least one event; if it does, only keep those rows where event == 1; then uniform-randomly select a single row using slice_sample per group.

Here is a dplyr way in two steps.
data <- data.frame(id=c(1,1,1,1,2,2,2,3,3,3,4,4,4),
yearmonthweek=c(2012052,2012053,2012061,2012062,2013031,2013052,2013053,2012052,
2012053,2012054,2012071,2012073,2012074),
event=c(0,1,1,0,0,1,0,0,0,0,0,0,0),
a=c(11,12,13,10,11,12,15,14,13,15,19,10,20))
suppressPackageStartupMessages(
library(dplyr)
)
bind_rows(
data %>%
filter(event != 0) %>%
group_by(id) %>%
sample_n(size = 1),
data %>%
group_by(id) %>%
mutate(event = !all(event == 0)) %>%
filter(!event) %>%
sample_n(size = 1)
)
#> # A tibble: 4 × 4
#> # Groups: id [4]
#> id yearmonthweek event a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2012061 1 13
#> 2 2 2013052 1 12
#> 3 3 2012054 0 15
#> 4 4 2012071 0 19
Created on 2022-10-21 with reprex v2.0.2

With data.table:
library(data.table)
set.seed(1)
setorder(
unique(
setorder(
setDT(data)[
, idx := .I # add an index column to re-sort later
][
sample(nrow(data)) # randomize the table
],
-event # sort descending by event
),
by = "id" # get unique rows by id
),
idx # re-sort
)[, idx := NULL][] # remove the index column
#> id yearmonthweek event a
#> 1: 1 2012053 1 12
#> 2: 2 2013052 1 12
#> 3: 3 2012053 0 13
#> 4: 4 2012071 0 19

Related

How to summarize in R the number of first occurrences of a character string in a dataframe column?

I am trying to figure out a fast way to calculate the number of "first times" a specified character appears in a dataframe column, by groups. In this example, I am trying to summarize (sum) the number of first times, for each Period, the State of "X" appears, grouped by ID. I am looking for a fast way to process this because it is to be run against a database of several million rows. Maybe there is a good solution using the data.table package?
Immediately below I illustrate what I am trying to achieve, and at the bottom I post the code for the dataframe called testDF.
Code:
testDF <-
data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
Maybe we can group by 'ID' first and then create the column and then do a group by 'period' and summarise
library(dplyr)
testDF %>%
group_by(ID) %>%
mutate(`1stStateX` = row_number() == which(State == "X")[1]) %>%
group_by(Period) %>%
summarise(`1stStateX` = sum(`1stStateX`, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 5 × 2
Period `1stStateX`
<int> <int>
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
Another option will be to slice after grouping by 'ID', get the count and use complete to fill the 'Period' not available
library(tidyr)
testDF %>%
group_by(ID) %>%
slice(match('X', State)) %>%
ungroup %>%
count(Period, sort = TRUE ,name = "1stStateX") %>%
complete(Period = unique(testDF$Period),
fill = list(`1stStateX` = 0))
-output
# A tibble: 5 × 2
Period `1stStateX`
<int> <int>
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
Or similar option in data.table
library(data.table)
setDT(testDF)[, `1stStateX` := .I == .I[State == 'X'][1],
ID][, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period]
-output
Period 1stStateX
<int> <int>
1: 1 0
2: 2 1
3: 3 1
4: 4 0
5: 5 0

Use a specific value in summarise (dplyr) without filtering it out

I am trying to compare a new algorithm result versus an old one. I need to know approximately how many days of a difference the new algorithm has in predicting a "D" versus the old one.
I can't seem to figure out how to point to the first row (day) that contains a 'D' (min(day) and new == 'D') without filtering (I was able to grab the row using a double filter due to the grouping, but not use it). I want to use it in summarise using dplyr which is why I have included pseudo code similar to where i am currently at in my own dataset.
In my data there are groups of varying length (number of days) for each ID, which is why I made groups of different lengths in the example.
library(dplyr)
id = c(123,123,123,123,123,456,456,456,456)
old = c('S','S','S','S','D','S','S','D','D')
new = c('S','S','D','D','D','S','D','D','D')
day = c(1,2,3,4,5,1,2,3,4)
data = data.frame(id,old,new,day)
data
#> id old new day
#> 1 123 S S 1
#> 2 123 S S 2
#> 3 123 S D 3
#> 4 123 S D 4
#> 5 123 D D 5
#> 6 456 S S 1
#> 7 456 S D 2
#> 8 456 D D 3
#> 9 456 D D 4
d = data %>%
group_by(id)%>%
arrange(day,.by_group=T)%>%
add_tally(new=='S',name='S')%>%
add_tally(new=='D',name='D')%>%
group_by(id,S,D)
# summarise(diff = (day of 1st old D) - (day of 1st new D) )
#Expected Outcome
ido = c(123,456)
S = c(2,1)
D = c(3,3)
diff = c(2,1)
outcome = data.frame(ido,S,D,diff)
outcome
#> ido S D diff
#> 1 123 2 3 2
#> 2 456 1 3 1
Created on 2019-12-26 by the reprex package (v0.3.0)
We can group_by id and count the occurrence of 'S' and 'D' and the difference between first occurrence of old and new 'D'.
library(dplyr)
data %>%
group_by(id) %>%
summarise(S = sum(new == 'S'),
D = sum(new == 'D'),
diff = which.max(old == 'D') - which.max(new == 'D'))
#OR if there could be id without D use
#diff = which(old == 'D')[1] - which(new == 'D')[1])
# A tibble: 2 x 4
# id S D diff
# <dbl> <int> <int> <int>
#1 123 2 3 2
#2 456 1 3 1
We can use pivot_wider after summariseing to get the frequency count after creating a column to take the difference between the 'day' based on the first occurence of 'D' in both 'old' and 'new' columnss
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
group_by(diff = day[match("D", old)] - day[match("D", new)],
new, add = TRUE) %>%
summarise(n = n()) %>%
ungroup %>%
pivot_wider(names_from = new, values_from = n)
# A tibble: 2 x 4
# id diff D S
# <dbl> <dbl> <int> <int>
#1 123 2 3 2
#2 456 1 3 1

dplyr: calculate the days for product replenishment

I am working on a dataset in which I need to calculate how long does it take for a retail store to replenish some products from shortage, and here is a quick view of the dataset in the simplest form:
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
When the Net_Available_Qty becomes negative, it means there is a shortage. When it turns back to 0 or positive qty, it means the supply has been recovered. What I need to calculate is the days between when we first see shortage and when it is recovered. In this case, for the 1st shortage, it took 2 days to recover and for the second shortage, it took 3 days to recover.
A tidyverse solution would be most welcome.
I hope someone else finds a cleaner solution. But this produces diffDate which assigns the date difference from when a negative turns positive/zero.
sample_df %>%
mutate(sign = ifelse(Net_Available_Qty > 0, "pos", ifelse(Net_Available_Qty < 0, "neg", "zero")),
sign_lag = lag(sign, default = sign[1]), # get previous value (exception in the first place)
change = ifelse(sign != sign_lag, 1 , 0), # check if there's a change
sequence=sequence(rle(as.character(sign))$lengths)) %>%
group_by(sequence) %>%
mutate(diffDate = as.numeric(difftime(Date, lag(Date,1))),
diffDate=ifelse(Net_Available_Qty <0, NA, ifelse((sign=='pos'| sign=='zero') & sequence==1, diffDate, NA))) %>%
ungroup() %>%
select(Date, Product, Net_Available_Qty, diffDate)
#Schilker had a great idea using rle. I am building on his answer and offering a slightly shorter version including the use of cumsum
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
library(tidyverse)
sample_df %>%
mutate(
diffDate = c(1, diff(as.Date(Date))),
sequence = sequence(rle(Net_Available_Qty >= 0)$lengths),
group = cumsum(c(TRUE, diff(sequence)) != 1L)
) %>%
group_by(group) %>%
mutate(n_days = max(cumsum(diffDate)))
#> # A tibble: 8 x 7
#> # Groups: group [4]
#> Date Product Net_Available_Qty diffDate sequence group n_days
#> <fct> <fct> <dbl> <dbl> <int> <int> <dbl>
#> 1 2019-1-1 Product A -2 1 1 0 2
#> 2 2019-1-2 Product A -2 1 2 0 2
#> 3 2019-1-3 Product A 10 1 1 1 2
#> 4 2019-1-4 Product A 8 1 2 1 2
#> 5 2019-1-5 Product A -5 1 1 2 3
#> 6 2019-1-6 Product A -6 1 2 2 3
#> 7 2019-1-7 Product A -7 1 3 2 3
#> 8 2019-1-8 Product A 0 1 1 3 1
Created on 2020-02-23 by the reprex package (v0.3.0)

How to find duplicates based on values in 2 columns but also the groupings by another column in R?

I have a dataset with 3 columns: ID, value a, and value b. I want to group the dataset based on the values in the ID column and then identify duplicates that have identical data in the value a and b columns between the different groupings.
I know that I can use the dplyr package and data %>% group_by (ID) to group my dataset based on the ID column. I also know that I can use data[duplicated(data[,2:3]),] to return all rows with duplicate data in rows 2 (value a) and 3 (value b).
However, I would like a function that can only finds duplicates between different ID groups instead of just duplicates within the whole dataset. I've tried combining group_by and duplicated, but it doesn't return the correct results. Which function would do this?
It was a little unclear if you wanted to return:
only the distinct rows
single examples of duplicated rows
all duplicated rows
So here are some options:
library(dplyr)
library(readr)
"ID,a,b
1, 1, 1
1, 1, 1
1, 1, 2
2, 1, 1
2, 1, 2" %>%
read_csv() -> exp_dat
# return only distinct rows
exp_dat %>%
distinct(ID, a, b)
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 2 1 1
# 4 2 1 2
# return single examples of duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 1 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# return all duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
add_count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 2 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 1

Filtering on grouped variable maintaining sequence

I have a dataframe :
df <- data.frame(
Group=c('A','A','A','A','B','B','B','B'),
Activity = c('EOSP','NOR','EOSP','COSP','NOR','EOSP','WL','NOR'),
TimeLine=c(1,2,3,4,1,2,3,4)
)
I want to filter for only two activities for each group and in the order in which I am filtering. For example, I am only looking for the activities EOSP and NOR but in the order too. This code:
df %>% group_by(Group) %>%
filter(all(c('EOSP','NOR') %in% Activity) & Activity %in% c('EOSP','NOR'))
results in:
# A tibble: 6 x 3
# Groups: Group [2]
Group Activity TimeLine
<fct> <fct> <dbl>
1 A EOSP 1
2 A NOR 2
3 A EOSP 3
4 B NOR 1
5 B EOSP 2
6 B NOR 4
I don't want row 3 as EOSP occurs after NOR. Similarly for group B, I don't want row 4, as NOR is occurring before EOSP. How do I achieve this?
You can use match to get the first instance of Activity == EOSP and use slice to remove everything before that. Once you do that, then you can remove duplicates and filter on EOSP and NOR, i.e.
library(tidyverse)
df %>%
group_by(Group) %>%
mutate(new = match('EOSP', Activity)) %>%
slice(new:n()) %>%
distinct(Activity, .keep_all = TRUE) %>%
filter(Activity %in% c('EOSP', 'NOR'))
which gives,
# A tibble: 4 x 4
# Groups: Group [2]
Group Activity TimeLine new
<fct> <fct> <dbl> <int>
1 A EOSP 1 1
2 A NOR 2 1
3 B EOSP 2 2
4 B NOR 4 2
NOTE 1: You can ungroup() and select(-new)
NOTE 2: The warning messages being issued here
(Warning messages:
1: In new:4L : numerical expression has 4 elements: only the first used
2: In new:4L : numerical expression has 4 elements: only the first used
)
do not affect us since we only need it to use the first element since all are the same anyway
here is an option with data.table package: you join df with itself, subsetted it to keep only EOSP Activity and computing the min of TimeLine by group, then you can keep only the rows with TimeLine greater or equal to this TimeLine, in order to be sure you keep NOR only if there is EOSP before. Then you drop duplicated Group and Activity if you want to only keep 2 activities per group:
df[df[Activity=="EOSP", min(TimeLine), by=Group], on="Group"][Activity %in% c("NOR", "EOSP") & TimeLine >= V1][!duplicated(paste(Group, Activity))]
# Group Activity TimeLine V1
#1: A EOSP 1 1
#2: A NOR 2 1
#3: B EOSP 2 2
#4: B NOR 4 2
Here is a dplyr idea:
df %>%
filter(Activity %in% c('EOSP','NOR')) %>%
group_by(Group) %>%
mutate(tmp = which(Activity == 'EOSP' & !duplicated(Activity))) %>%
filter(row_number() %in% c(tmp, tmp+1))
# A tibble: 4 x 4
# Groups: Group [2]
Group Activity TimeLine tmp
<fct> <fct> <dbl> <int>
1 A EOSP 1 1
2 A NOR 2 1
3 B EOSP 2 2
4 B NOR 4 2

Resources