R: Get unique values based on criteria from 2 other columns

R: Get unique values based on criteria from 2 other columns - r

Hi I would like to get only 1 unique Code for each rows. To get that 1 uniqe Code the criteria should be get nearest Refresh Date that is >= Effective Date. And if there is no Refresh date that is >= Effective date then just get the nearest Resfresh Date < Effective date.
Below is my sample dataframe.
Code <- c("A","A","A", "A", "B", "B", "B", "B", "C","C","C","C")
Effective_Date <- as.Date(c("2020-08-25","2020-08-25","2020-08-25","2020-08-25","2021-12-18","2021-12-18",
"2021-12-18","2021-12-18","2021-10-15","2021-10-15","2021-10-15","2021-10-15"))
Refresh_Date <- as.Date(c("2020-09-25","2021-09-17","2022-11-25","2020-02-20","2021-12-12","2021-12-18",
"2022-01-15","2021-08-19","2021-08-20","2020-08-25","2021-09-30","2020-08-25"))
DF <- data.frame(Code,Effective_Date,Refresh_Date)
> DF
Code Effective_Date Refresh_Date
1 A 2020-08-25 2021-09-17
2 A 2020-08-25 2020-09-25
3 A 2020-08-25 2022-11-25
4 A 2020-08-25 2020-02-20
5 B 2021-12-18 2021-12-14
6 B 2021-12-18 2021-12-18
7 B 2021-12-18 2022-01-15
8 B 2021-12-18 2021-08-19
9 C 2021-10-15 2021-08-20
10 C 2021-10-15 2020-08-25
11 C 2021-10-15 2021-09-30
12 C 2021-10-15 2020-08-25
It's just like aggregating to Code and Effective Date. But get the row that has the nearest Refresh Date >= Effective Date. And if there is no Refresh Date that is >= Effective Date then just get the nearest Refresh Date < Effective Date.
Below is my desired output:
> DF_DesiredOutput
Code Effective_Date Refresh_Date
1 A 2020-08-25 2020-09-25
2 B 2021-12-18 2021-12-18
3 C 2021-10-15 2021-09-30

We can use slice on the difference of 'Refresh_Date' and 'Effective_Date', get the index of the min value, after grouping by 'Code'
library(dplyr)
DF %>%
group_by(Code) %>%
slice(which.min(abs(Refresh_Date - Effective_Date))) %>%
ungroup
-output
# A tibble: 3 × 3
Code Effective_Date Refresh_Date
<chr> <date> <date>
1 A 2020-08-25 2020-09-25
2 B 2021-12-18 2021-12-18
3 C 2021-10-15 2021-09-30

Here is an alternative approach using arrange by the absolute difference and then slice:
library(dplyr)
DF %>%
group_by(Code) %>%
arrange(abs(Refresh_Date-Effective_Date), .by_group = TRUE) %>%
slice(1)
Code Effective_Date Refresh_Date
<chr> <date> <date>
1 A 2020-08-25 2020-09-25
2 B 2021-12-18 2021-12-18
3 C 2021-10-15 2021-09-30

Related

Calculate the days differences with mixed date format in R

I need to count differences in days between two mixed-structured dates. Here is an example dataset:
testdata <- data.frame(id = c(1,2,3),
date1 = c("2022/11/13 9:19:03 AM PST", "2022-11-01","2022-10-28"),
date2 = c("2022/12/12 1:52:29 PM PST","2022-10-21","2022/12/01 8:15:25 AM PST"))
> testdata
id date1 date2
1 1 2022/11/13 9:19:03 AM PST 2022/12/12 1:52:29 PM PST
2 2 2022-11-01 2022-10-21
3 3 2022-10-28 2022/12/01 8:15:25 AM PST
First I need to grab dates, exclude the hours, and calculate the number of days differences. So the expected dataset would be:
> df
id date1 date2. days.diff
1 1 2022/11/13 2022/12/12 19
2 2 2022-11-01 2022-10-21 11
3 3 2022-10-28 2022/12/01 34

You could use the anytime package with anytime to calculate the difference in dates rowwise like this:
library(dplyr)
library(anytime)
testdata %>%
rowwise() %>%
mutate(days.diff = anytime(date1) - anytime(date2))
#> # A tibble: 3 × 4
#> # Rowwise:
#> id date1 date2 days.diff
#> <dbl> <chr> <chr> <drtn>
#> 1 1 2022/11/13 9:19:03 AM PST 2022/12/12 1:52:29 PM PST -29.00000 days
#> 2 2 2022-11-01 2022-10-21 11.04167 days
#> 3 3 2022-10-28 2022/12/01 8:15:25 AM PST -34.04167 days
Created on 2023-01-20 with reprex v2.0.2

Using as.Date with tryFormats
library(dplyr)
testdata %>%
rowwise() %>%
mutate(across(starts_with("date"), ~ as.Date(.x,
tryFormats=c("%Y/%m/%d %H:%M:%S", "%Y-%m-%d"))),
days.diff = date2 - date1) %>%
ungroup()
# A tibble: 3 × 4
id date1 date2 days.diff
<dbl> <date> <date> <drtn>
1 1 2022-11-13 2022-12-12 29 days
2 2 2022-11-01 2022-10-21 -11 days
3 3 2022-10-28 2022-12-01 34 days

Obtaining values in one variable (height/weight) based on when it was collected (dates)

I'm working with a dataset where I have the date that a given value (weight) was collected, and then the weight (for that date). Some participants have multiple weights in the dataset because they have come back more than once; others only have one weight value. Is there an easy way to ask R to provide a new dataframe with one value per person, based on the earliest date? (And by default, those with only one value are included)?
I'm wondering if it would be advantageous to group by a subject ID and get their mean weight value (as I don't anticipate it may fluctuate drastically). But to be consistent, grouping based on the earliest/first weight recorded would be ideal.
I'm thinking possibly a function in the 'lubridate' package would be useful, but I'm not 100%.

Sort by date, group by id, then take the first row per group:
library(dplyr)
weights %>%
arrange(date) %>%
group_by(id) %>%
slice(1) %>%
ungroup()
#> # A tibble: 3 × 3
#> id date weight
#> <int> <date> <dbl>
#> 1 1 2021-03-15 182.
#> 2 2 2021-05-12 133.
#> 3 3 2021-08-09 151.
Example data:
set.seed(13)
weights <- tibble::tibble(
id = rep(1:3, each = 3),
date = lubridate::ymd("2021-01-01") + sample(0:364, 9),
weight = rnorm(9, 160, 20)
)
weights
#> # A tibble: 9 × 3
#> id date weight
#> <int> <date> <dbl>
#> 1 1 2021-09-16 165.
#> 2 1 2021-12-23 153.
#> 3 1 2021-03-15 182.
#> 4 2 2021-07-24 138.
#> 5 2 2021-09-19 169.
#> 6 2 2021-05-12 133.
#> 7 3 2021-11-16 123.
#> 8 3 2021-08-09 151.
#> 9 3 2021-09-05 156.
Created on 2022-11-11 with reprex v2.0.2

Create columns based on date

case <- c("A","A","A","B","B","C","C","C","C")
date <- c("2022-01-01","2022-01-08","2022-06-07","2022-05-08","2022-03-06","2022-09-08","2022-09-23","2022-12-08","2022-06-05")
df <- data.frame(case,date)
I have a dataframe that looks like this:
# A tibble: 9 x 2
case date
<chr> <chr>
1 A 2022-01-01
2 A 2022-01-08
3 A 2022-06-07
4 B 2022-05-08
5 B 2022-03-06
6 C 2022-09-08
7 C 2022-09-23
8 C 2022-12-08
9 C 2022-06-05
I would like to essentially pivot_wider the rows based on date where the earliest date would become instance_1, next instance_2 and so far. I have tried the pivot_wider function but can't get the syntax right. Any help is appreciated.

We need a sequence column by 'case' and then do pivot_wider
library(tidyr)
library(dplyr)
library(data.table)
library(stringr)
df %>%
arrange(case, date) %>%
mutate(cn = str_c('instance_', rowid(case))) %>%
pivot_wider(names_from = cn, values_from = date)
-output
# A tibble: 3 × 5
case instance_1 instance_2 instance_3 instance_4
<chr> <chr> <chr> <chr> <chr>
1 A 2022-01-01 2022-01-08 2022-06-07 <NA>
2 B 2022-03-06 2022-05-08 <NA> <NA>
3 C 2022-06-05 2022-09-08 2022-09-23 2022-12-08
Or a similar option with dcast
library(data.table)
dcast(setDT(df)[order(case, date)],
case ~ paste0('instance_', rowid(case)), value.var = 'date')
-output
Key: <case>
case instance_1 instance_2 instance_3 instance_4
<char> <char> <char> <char> <char>
1: A 2022-01-01 2022-01-08 2022-06-07 <NA>
2: B 2022-03-06 2022-05-08 <NA> <NA>
3: C 2022-06-05 2022-09-08 2022-09-23 2022-12-08

R: Convert monthly data into daily data for panel data

I have the following data:
5 Products with a monthly rating from 2018-08 to 2018-12
Now with the help of R programming I would like to convert the monthly data into daily data and to have panel data.The monthly rating for each product will also be the rating for each day in the respective month.
So, that the new data will look like:
(with the first column being the product, the second column the date and the third column the rating)
A 2018-08-01 1
A 2018-08-02 1
A 2018-08-03 1
A 2018-08-04 1
... so on
A 2018-09-01 1
A 2018-09-02 1
...so on
A 2018-12-31 1
B 2018-08-01 3
B 2018-08-02 3
... so on
E 2018-12-31 3

library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
# example data
data <- tribble(
~Product, ~`Product Rating 2018-08`, ~`Product Rating 2018-10`,
"A", 1, 1,
"B", 3, 3,
)
data2 <-
data %>%
pivot_longer(-Product) %>%
mutate(
name = name %>% str_extract("[0-9-]+$") %>% paste0("-01") %>% as.Date()
)
seq(as.Date("2018-08-01"), as.Date("2018-12-31"), by = "days") %>%
tibble(date = .) %>%
# left join on year and month
expand_grid(data2) %>%
filter(month(date) == month(name) & year(date) == year(name)) %>%
select(Product, date, value)
#> # A tibble: 124 × 3
#> Product date value
#> <chr> <date> <dbl>
#> 1 A 2018-08-01 1
#> 2 B 2018-08-01 3
#> 3 A 2018-08-02 1
#> 4 B 2018-08-02 3
#> 5 A 2018-08-03 1
#> 6 B 2018-08-03 3
#> 7 A 2018-08-04 1
#> 8 B 2018-08-04 3
#> 9 A 2018-08-05 1
#> 10 B 2018-08-05 3
#> # … with 114 more rows
Created on 2022-03-09 by the reprex package (v2.0.0)

Tally if observations fall in date windows

I have a data frame that represents policies with start and end dates. I'm trying to tally the count of policies that are active each month.
library(tidyverse)
ayear <- 2021
amonth <- 10
months <- 12
df <- tibble(
pol = c(1, 2, 3, 4)
, bdate = c('2021-02-23', '2019-12-03', '2020-08-11', '2020-12-14')
, edate = c('2022-02-23', '2020-12-03', '2021-08-11', '2021-06-14')
)
These four policies have a begin date (bdate) and end date (edate). Beginning in October (amonth) 2021 (ayear) and going back 12 months (months) I'm trying to generate a count of how many of the 4 policies were active at some point in the month to generate a data frame that looks something like this.
Data frame I'm trying to generate would have three columns: month, year, and active_pol_count with 12 rows. Like this.

library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df <- tibble(
pol = c(1, 2, 3, 4),
bdate = c("2021-02-23", "2019-12-03", "2020-08-11", "2020-12-14"),
edate = c("2022-02-23", "2020-12-03", "2021-08-11", "2021-06-14")
)
# transform star and end date to interval
df <- mutate(df, interval = interval(bdate, edate))
# for every first date of each month between 2020-10 to 2021-10
seq(as.Date("2020-10-01"), as.Date("2021-09-01"), by = "months") %>%
tibble(date = .) %>%
mutate(
year = year(date),
month = month(date),
active_pol_count = date %>% map_dbl(~ .x %within% df$interval %>% sum()),
)
#> # A tibble: 12 x 4
#> date year month active_pol_count
#> <date> <dbl> <dbl> <dbl>
#> 1 2020-10-01 2020 10 2
#> 2 2020-11-01 2020 11 2
#> 3 2020-12-01 2020 12 2
#> 4 2021-01-01 2021 1 2
#> 5 2021-02-01 2021 2 2
#> 6 2021-03-01 2021 3 3
#> 7 2021-04-01 2021 4 3
#> 8 2021-05-01 2021 5 3
#> 9 2021-06-01 2021 6 3
#> 10 2021-07-01 2021 7 2
#> 11 2021-08-01 2021 8 2
#> 12 2021-09-01 2021 9 1
Created on 2021-12-13 by the reprex package (v2.0.1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Get unique values based on criteria from 2 other columns - r

Related

Calculate the days differences with mixed date format in R

Obtaining values in one variable (height/weight) based on when it was collected (dates)

Create columns based on date

R: Convert monthly data into daily data for panel data

Tally if observations fall in date windows

Categories

Resources