R: Identify first, second, and third dates of multiple columns and rows - r

I have a dataset where individuals can have multiple rows of data and seven columns where dates have been listed. I'm trying to find the first, second, and third earliest dates.
> head(Addim_try)
# A tibble: 6 x 8
ID HistoryDate1 HistoryDate2 HistoryDate3 HistoryDate4 Date1 Date2 PDate1
<dbl> <dttm> <dttm> <dttm> <dttm> <dttm> <dttm> <dttm>
1 1317051 NA NA NA NA NA NA 2022-04-05 00:00:00
2 1317051 2021-06-19 00:00:00 2021-07-10 00:00:00 NA NA NA NA NA
3 1317079 2021-08-10 00:00:00 2021-08-31 00:00:00 NA NA NA NA NA
4 1317079 2022-01-21 00:00:00 NA NA NA NA NA NA
5 1324163 2022-04-08 00:00:00 NA NA NA NA NA NA
6 1324163 2021-08-07 00:00:00 2021-10-09 00:00:00 NA NA NA NA NA
1 1279491 2021-06-14 00:00:00 2021-07-12 00:00:00 NA NA NA NA NA
I'm considering first identifying the first, second, and third dose by row (I would show this but suddenly my code isn't working) and then re-shaping my data to wide format (although I'm getting stuck here, so any suggestions on how to re-shape would be helpful)... Any better ideas?

We could reshape to 'long' with pivot_longer and get the first 3 dates with slice_min
library(dplyr)
library(tidyr)
Addim_try %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -c(rn, ID), values_to = "Date", values_drop_na = TRUE) %>%
group_by(rn) %>%
slice_min(n = 3, order_by = Date) %>%
ungroup %>%
select(-rn)

Related

Why does grepl work but not str_detect for mutate depending on row value?

I have been trying to wrap my head around this.
I need to create a corrected column based on detecting a specific comment at another "error" column in my database. I can work around this with grepl, but I am struggling with getting str_detect to work as well (it is usually faster for big datasets).
Here is an example database:
test <- tibble(
id = seq(1:30),
date = sample(seq(as.Date('2000/01/01'), as.Date('2018/01/01'), by="day"), 30),
error = c(rep(NA, 3), "wrong date! Correct date = 01.03.2022",
rep(NA, 5), "wrong date! Correct date = 01.05.2021",
rep(NA, 5), "wrong date! Correct date = 01.03.2022",
rep(NA, 7), "wrong date! Correct date = 01.05.2021",
rep(NA, 2), "date already corrected on 01.05.2021",
NA, "date already corrected on 01.03.2022", NA))
I first tried to create a new "date_corr" column with str_detect:
test %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
This yields:
A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
Adding rowwise is irrelevant:
test %>%
rowwise() %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
However, with grepl I get the desired outcome, regardless of rowwise:
test %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
# A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
test %>%
rowwise() %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
What I am missing here?
The difference is how they handle NA values
str_detect(NA, "missing")
# [1] NA
grepl("missing", NA)
# [1] FALSE
And note that if you have an NA value in the condition for if_else, it will also preserve the NA value
if_else(NA, 1, 2)
# [1] NA
The str_detect preserved the NA value. It's not clear what the "right" value should be. But if you want str_detect to have the same values as grepl, you can be explicit about not changing NA values
test %>%
mutate(date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))

How can i convert logical columns to numeric in R using dplyr?

i have a data frame that i import it from an excel (.xlsx) file and looks like this :
>data
# A tibble: 3,338 x 4
Dates A B C
<dttm> <lgl> <lgl> <lgl>
1 2009-01-05 00:00:00 NA NA NA
2 2009-01-06 00:00:00 NA NA NA
3 2009-01-07 00:00:00 NA NA NA
4 2009-01-08 00:00:00 NA NA NA
5 2009-01-09 00:00:00 NA NA NA
6 2009-01-12 00:00:00 NA NA NA
7 2009-01-13 00:00:00 NA NA NA
8 2009-01-14 00:00:00 NA NA NA
9 2009-01-15 00:00:00 NA NA NA
10 2009-01-16 00:00:00 NA NA NA
# ... with 3,328 more rows
# i Use `print(n = ...)` to see more rows
The problem is that these three columns A,B,C contain numeric values but after some 3 thousand rows.
but trying to convert them into numeric i did :
data%>%
dplyr::mutate(date = as.Date(Dates))%>%
dplyr::select(-Dates)%>%
dplyr::relocate(date,.before="A")%>%
dplyr::mutate_if(is.logical, as.numeric)%>%
tidyr::pivot_longer(!date,names_to = "var", values_to = "y")%>%
dplyr::group_by(var)%>%
dplyr::arrange(var)%>%
tidyr::drop_na()
but the problem remains :
date var y
<date> <chr> <dbl>
1 2021-11-30 A 1
2 2021-12-01 A 1
3 2021-12-02 A 1
4 2021-12-03 A 1
5 2021-12-06 A 1
6 2021-12-07 A 1
7 2021-12-08 A 1
8 2021-12-09 A 1
9 2021-12-10 A 1
10 2021-12-13 A 1
# ... with 189 more rows
any help ?
Summing up from comments:
it's usually easier to fix conversation errors closer to original source as possible.
read_xlsx tries to guess column types by checking first guess_max rows, guess_max being a read_xlsx parameter with a default value of min(1000, n_max). If read_xlsx gets columns types wrong because those cols are filled with NA for the first 1000 rows, just increasing guess_max parameter might be a viable solution:
readxl::read_xlsx("file.xlsx", guess_max = 5000)
Though for a simple 4-column dataset one shouldn't need more than defining correct column types manually:
readxl::read_xlsx("file.xlsx", col_types = c("date", "numeric", "numeric", "numeric"))
If NA values are only at the beginning of some columns, changing sorting order in Excel itself and moving NAs from top before importing the file into R should also work.

R inserting rows between dates by group based on second column

I have a df that looks like this
ID FINAL_DT START_DT
23 NA 2020-03-20
25 NA 2020-04-10
29 2020-02-02 2020-01-23
30 NA 2020-01-02
What I would like to do is for each ID add a row for every month starting from START_DT and ending at whatever comes first FINAL_DT or the current date. Expected output would be the follow:
ID FINAL_DT START_DT ACTIVE_MONTH
23 NA 2020-03-20 2020-03
23 NA NA 2020-04
23 NA NA 2020-05
25 NA 2020-04-10 2020-04
25 NA NA 2020-05
29 2020-02-02 2020-01-23 2020-01
29 2020-02-02 NA 2020-02
30 NA 2020-01-02 2020-01
30 NA NA 2020-02
30 NA NA 2020-03
30 NA NA 2020-04
30 NA NA 2020-05
I have the following code which works but does not account for FINAL_DT
current_date = as.Date(Sys.Date())
enroll <- enroll %>%
group_by(ID) %>%
complete(START_DATE = seq(START_DATE, current_date, by = "month"))
I have tried the following but get an error I believe due to the NA's:
current_date = as.Date(Sys.Date())
enroll <- enroll %>%
group_by(ID) %>%
complete(START_DATE = seq(START_DATE, min(FINAL_DT,current_date), by = "month"))
The day of the month also does not matter I am not sure if it would be easier to drop that before or after.
Here is another approach. You can use floor_date to get the first day of the month to use in your sequence of months. Then, you can include the full sequence to today's date, and filter based on FINAL_DT. You can use as.yearmon from zoo if you'd like a month/year object for month.
library(zoo)
library(tidyr)
library(dplyr)
library(lubridate)
current_date = as.Date(Sys.Date())
enroll %>%
mutate(ACTIVE_MONTH = floor_date(START_DT, unit = "month")) %>%
group_by(ID) %>%
complete(ACTIVE_MONTH = seq.Date(floor_date(START_DT, unit = "month"), current_date, by = "month")) %>%
filter(ACTIVE_MONTH <= first(FINAL_DT) | is.na(first(FINAL_DT))) %>%
ungroup() %>%
mutate(ACTIVE_MONTH = as.yearmon(ACTIVE_MONTH))
Output
# A tibble: 12 x 4
ID ACTIVE_MONTH FINAL_DT START_DT
<dbl> <yearmon> <date> <date>
1 23 Mar 2020 NA 2020-03-20
2 23 Apr 2020 NA NA
3 23 May 2020 NA NA
4 25 Apr 2020 NA 2020-04-10
5 25 May 2020 NA NA
6 29 Jan 2020 2020-02-02 2020-01-23
7 29 Feb 2020 NA NA
8 30 Jan 2020 NA 2020-01-02
9 30 Feb 2020 NA NA
10 30 Mar 2020 NA NA
11 30 Apr 2020 NA NA
12 30 May 2020 NA NA
Here is an approach that returns rows for each MONTH with the help of lubridate.
library(dplyr)
library(tidyr)
library(lubridate)
current_date = as.Date(Sys.Date())
enroll %>%
mutate(MONTH = month(START_DT)) %>%
group_by(ID) %>%
complete(MONTH = seq(MONTH, min(month(FINAL_DT)[!is.na(FINAL_DT)],month(current_date))))
# A tibble: 12 x 4
# Groups: ID [4]
# ID MONTH FINAL_DT START_DT
# <int> <dbl> <fct> <fct>
# 1 23 3 NA 2020-03-20
# 2 23 4 NA NA
# 3 23 5 NA NA
# 4 25 4 NA 2020-04-10
# 5 25 5 NA NA
# 6 29 1 2020-02-02 2020-01-23
# 7 29 2 NA NA
# 8 30 1 NA 2020-01-02
# 9 30 2 NA NA
#10 30 3 NA NA
#11 30 4 NA NA
#12 30 5 NA NA

How to create month-end date series using complete function?

Here is my toy dataset:
df <- tibble::tribble(
~date, ~value,
"2007-01-31", 25,
"2007-05-31", 31,
"2007-12-31", 26
)
I am creating month-end date series using the following code.
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(as.Date("2007-01-31"), as.Date("2019-12-31"), by="month"))
However, I am not getting the correct month-end dates.
date value
<date> <dbl>
1 2007-01-31 25
2 2007-03-03 NA
3 2007-03-31 NA
4 2007-05-01 NA
5 2007-05-31 31
6 2007-07-01 NA
7 2007-07-31 NA
8 2007-08-31 NA
9 2007-10-01 NA
10 2007-10-31 NA
11 2007-12-01 NA
12 2007-12-31 26
What am I missing here? I am okay using other functions from any other package.
No need of complete function, you can do this in base R.
Since last day of the month is different for different months, we can create a sequence of monthly start dates and subtract 1 day from it.
seq(as.Date("2007-02-01"), as.Date("2008-01-01"), by="month") - 1
#[1] "2007-01-31" "2007-02-28" "2007-03-31" "2007-04-30" "2007-05-31" "2007-06-30"
# "2007-07-31" "2007-08-31" "2007-09-30" "2007-10-31" "2007-11-30" "2007-12-31"
Using the same logic in updated dataframe, we can do :
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
tidyr::complete(date = seq(min(date) + 1, max(date) + 1, by="month") - 1)
# date value
# <date> <dbl>
# 1 2007-01-31 25
# 2 2007-02-28 NA
# 3 2007-03-31 NA
# 4 2007-04-30 NA
# 5 2007-05-31 31
# 6 2007-06-30 NA
# 7 2007-07-31 NA
# 8 2007-08-31 NA
# 9 2007-09-30 NA
#10 2007-10-31 NA
#11 2007-11-30 NA
#12 2007-12-31 26

Spread and Gather table return duplicated rows with NA values

I have a table with categories and sub categories encoded in this format of columns name:
Date| Admissions__0 |Attendance__0 |Tri_1__0|Tri_2__0|...
Tri_1__1|Tri_2__1|...|
and I would like to change it to this format of columns using spread and gather function of tidyverse:
Date| Country code| Admissions| Attendance| Tri_1|Tri_2|...
I tried a solution posted but the outcome actually return multiple rows with NA rather than a single row.
My code used:
temp <- data %>% gather(key="columns",value ="dt",-Date)
temp <- temp %>% mutate(category = gsub(".*__","",columns)) %>% mutate(columns = gsub("__\\d","",columns))
temp %>% mutate(row = row_number()) %>% spread(key="columns",value="dt")
And my results is:
Date country_code row admissions attendance Tri_1 Tri_2 Tri_3 Tri_4 Tri_5
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 01-APR-2014 0 275 NA 209 NA NA NA NA NA
2 01-APR-2014 0 640 84 NA NA NA NA NA NA
3 01-APR-2014 0 1005 NA NA 5 NA NA NA NA
4 01-APR-2014 0 1370 NA NA NA 33 NA NA NA
5 01-APR-2014 0 1735 NA NA NA NA 62 NA NA
6 01-APR-2014 0 2100 NA NA NA NA NA 80 NA
7 01-APR-2014 0 2465 NA NA NA NA NA NA 29
8 01-APR-2014 1 2830 NA 138 NA NA NA NA NA
9 01-APR-2014 1 3195 66 NA NA NA NA NA NA
10 01-APR-2014 1 3560 NA NA N/A NA NA NA NA
My expected results:
Date country_code row admissions attendance Tri_1 Tri_2 Tri_3 Tri_4 Tri_5
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 01-APR-2014 0 275 84 209 5 33 62 80 29
8 01-APR-2014 1 2830 66 138 66 ... ... ... ...
We can do a summarise_at coalesce to remove the NA elements after the spread
library(tidyverse)
data %>%
gather(key = "columns", val = "dt", -Date, na.rm = TRUE) %>%
mutate(category = gsub(".*__","",columns)) %>%
mutate(columns = gsub("__\\d","",columns)) %>%
group_by(Date, dt, columns, category) %>%
mutate(rn = row_number()) %>%
spread(columns, dt) %>%
select(-V1) %>%
summarise_at(vars(Admissions:Tri_5),list(~ coalesce(!!! .))) # %>%
# filter if needed
#filter_at(vars(Admissions:Tri_5), all_vars(!is.na(.)))

Resources