Pad within grouped dates in R - r

library(tidyverse)
library(lubridate)
library(padr)
df <- tibble(`Action Item ID` = c("ABC", "DEF", "GHI", "JKL", "MNO", "PQR"),
`Date Created` = as.Date(c("2019-01-01", "2019-01-01",
"2019-06-01", "2019-06-01",
"2019-08-01", "2019-08-01")),
`Date Closed` = as.Date(c("2019-01-15", "2019-05-31",
"2019-06-15", "2019-07-05",
"2019-08-15", NA)),
`Current Status` = c(rep("Closed", 5), "Open")) %>%
pivot_longer(-c(`Action Item ID`, `Current Status`),
names_to = "Type",
values_to = "Date")
#> # A tibble: 12 x 4
#> `Action Item ID` `Current Status` Type Date
#> <chr> <chr> <chr> <date>
#> 1 ABC Closed Date Created 2019-01-01
#> 2 ABC Closed Date Closed 2019-01-15
#> 3 DEF Closed Date Created 2019-01-01
#> 4 DEF Closed Date Closed 2019-05-31
#> 5 GHI Closed Date Created 2019-06-01
#> 6 GHI Closed Date Closed 2019-06-15
#> 7 JKL Closed Date Created 2019-06-01
#> 8 JKL Closed Date Closed 2019-07-05
#> 9 MNO Closed Date Created 2019-08-01
#> 10 MNO Closed Date Closed 2019-08-15
#> 11 PQR Open Date Created 2019-08-01
#> 12 PQR Open Date Closed NA
I've got my data frame above and I'm trying to pad dates within each group with the padr R package.
df %>% group_by(`Action Item ID`) %>% pad()
#> Error: Not all grouping variables are column names of x.
The error doesn't make much sense to me. I'm looking for output that would look like the following:
#> # A tibble: ? x 4
#> `Action Item ID` `Current Status` Type Date
#> <chr> <chr> <chr> <date>
#> ABC Closed Date Created 2019-01-01
#> ABC NA NA 2019-01-02
#> ABC NA NA 2019-01-03
#> ... ... ... ...
#> ABC Closed Date Closed 2019-01-15
#> DEF Closed Date Created 2019-01-01
#> DEF NA NA 2019-01-02
#> ... ... ... ...
#> DEF NA NA 2019-05-30
#> DEF Closed Date Closed 2019-05-31
#> GHI Closed Date Created 2019-06-01
#> ... ... ... ...
Anybody have any idea what went wrong?

According to ?pad, there is a group argument
group - Optional character vector that specifies the grouping variable(s). Padding will take place within the different groups. When interval is not specified, it will be determined applying get_interval on the datetime variable as a whole, ignoring groups (see last example).
So, it is better to make use of that parameter
library(dplyr)
library(padr)
df %>%
pad(group = "Action Item ID")
# A tibble: 233 x 4
# `Action Item ID` `Current Status` Type Date
# <chr> <chr> <chr> <date>
# 1 ABC Closed Date Created 2019-01-01
# 2 ABC <NA> <NA> 2019-01-02
# 3 ABC <NA> <NA> 2019-01-03
# 4 ABC <NA> <NA> 2019-01-04
# 5 ABC <NA> <NA> 2019-01-05
# 6 ABC <NA> <NA> 2019-01-06
# 7 ABC <NA> <NA> 2019-01-07
# 8 ABC <NA> <NA> 2019-01-08
# 9 ABC <NA> <NA> 2019-01-09
#10 ABC <NA> <NA> 2019-01-10
# … with 223 more rows

Related

Converting variable with 5 digit numbers and dates into date values

I have the following data, which contains some date values as 5 digit character values. When I try to convert to date, the correct date changes to NA value.
dt <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2),
Registrationdate=c('2019-01-09','2019-01-09','2019-01-09','2019-01-09','2019-01-09',
'2019-01-09',"44105","44105","44105","44105","44105"))
Expected value
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
I tried using
library(openxlsx)
dt$Registrationdate <- convertToDate(dt$Registrationdate, origin = "1900-01-01")
But I got
1 1 <NA>
2 1 <NA>
3 1 <NA>
4 1 <NA>
5 1 <NA>
6 1 <NA>
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
Here's one approach using a mix of dplyr and base R:
library(dplyr, warn = FALSE)
dt |>
mutate(Registrationdate = if_else(grepl("-", Registrationdate),
as.Date(Registrationdate),
openxlsx::convertToDate(Registrationdate, origin = "1900-01-01")))
#> Warning in openxlsx::convertToDate(Registrationdate, origin = "1900-01-01"): NAs
#> introduced by coercion
#> id Registrationdate
#> 1 1 2019-01-09
#> 2 1 2019-01-09
#> 3 1 2019-01-09
#> 4 1 2019-01-09
#> 5 1 2019-01-09
#> 6 1 2019-01-09
#> 7 2 2020-10-01
#> 8 2 2020-10-01
#> 9 2 2020-10-01
#> 10 2 2020-10-01
#> 11 2 2020-10-01
Created on 2022-10-15 with reprex v2.0.2
library(janitor)
dt$Registrationdate <- convert_to_date(dt$Registrationdate)
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
Another option is to import columns in the expected format. An example with openxlsx2 is shown below. The top half creates a file that causes the behavior you see with openxlsx. This is because some of the rows in the Registrationdate column are formatted as dates and some as strings, a fairly common error caused by the person who generated the xlsx input.
With openxlsx2 you can define the type of column you want to import. The option was inspired by readxl (iirc).
library(openxlsx2)
## prepare data
date_as_string <- data.frame(
id = rep(1, 6),
Registrationdate = rep('2019-01-09', 6)
)
date_as_date <- data.frame(
id = rep(2, 5),
Registrationdate = rep(as.Date('2019-01-10'), 5)
)
options(openxlsx2.dateFormat = "yyyy-mm-dd")
wb <- wb_workbook()$
add_worksheet()$
add_data(x = date_as_string)$
add_data(x = date_as_date, colNames = FALSE, startRow = 7)
#wb$open()
## read data as date
dt <- wb_to_df(wb, types = c(id = 1, Registrationdate = 2))
## check that Registrationdate is actually a Date column
str(dt$Registrationdate)
#> Date[1:10], format: "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" ...

Dataframe with start & end date to daily data

I am trying to convert below data on daily basis based on range available in start_date & end_date_ column.
to this output (sum):
Please use dput() when posting data frames next time!
Example data
# A tibble: 4 × 4
id start end inventory
<int> <chr> <chr> <dbl>
1 1 01/05/2022 02/05/2022 100
2 2 10/05/2022 15/05/2022 50
3 3 11/05/2022 21/05/2022 80
4 4 14/05/2022 17/05/2022 10
Transform the data
df %>%
mutate(across(2:3, ~ as.Date(.x,
format = "%d/%m/%Y"))) %>%
pivot_longer(cols = c(start, end), values_to = "date") %>%
arrange(date) %>%
select(date, inventory)
# A tibble: 8 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-10 50
4 2022-05-11 80
5 2022-05-14 10
6 2022-05-15 50
7 2022-05-17 10
8 2022-05-21 80
Expand the dates and left_join
left_join(tibble(date = seq(first(df$date),
last(df$date),
by = "day")), df)
# A tibble: 21 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-03 NA
4 2022-05-04 NA
5 2022-05-05 NA
6 2022-05-06 NA
7 2022-05-07 NA
8 2022-05-08 NA
9 2022-05-09 NA
10 2022-05-10 50
# … with 11 more rows

Using two summarise function in r

library(lubridate)
library(tidyverse)
step_count_raw <- read_csv("data/step-count/step-count.csv",
locale = locale(tz = "Australia/Melbourne"))
location <- read_csv("data/step-count/location.csv")
step_count <- step_count_raw %>%
rename_with(~ c("date_time", "date", "count")) %>%
left_join(location) %>%
mutate(location = replace_na(location, "Melbourne"))
step_count
#> # A tibble: 5,448 x 4
#> date_time date count location
#> <dttm> <date> <dbl> <chr>
#> 1 2019-01-01 09:00:00 2019-01-01 764 Melbourne
#> 2 2019-01-01 10:00:00 2019-01-01 913 Melbourne
#> 3 2019-01-02 00:00:00 2019-01-02 9 Melbourne
#> 4 2019-01-02 10:00:00 2019-01-02 2910 Melbourne
#> 5 2019-01-02 11:00:00 2019-01-02 1390 Melbourne
#> 6 2019-01-02 12:00:00 2019-01-02 1020 Melbourne
#> 7 2019-01-02 13:00:00 2019-01-02 472 Melbourne
#> 8 2019-01-02 15:00:00 2019-01-02 1220 Melbourne
#> 9 2019-01-02 16:00:00 2019-01-02 1670 Melbourne
#> 10 2019-01-02 17:00:00 2019-01-02 1390 Melbourne
#> # … with 5,438 more rows
I want to calculate average daily step counts for every location, from step_count. Then end up with a tibble called city_avg_steps.
expected output
#> # A tibble: 4 x 2
#> location avg_count
#> <chr> <dbl>
#> 1 Austin 7738.
#> 2 Denver 12738.
#> 3 Melbourne 7912.
#> 4 San Francisco 13990.
My code and output
city_avg_steps <- step_count%>%group_by(location)%>%summarise(avg_count=mean(count))
city_avg_steps
# A tibble: 4 x 2
location avg_count
<chr> <dbl>
1 Austin 721.
2 Denver 650.
3 Melbourne 530.
4 San Francisco 654.
I have a clue is to calculate daily number first then cumulate the result using two summarise fuction,but not sure how to add.
As #dash2 explains in the comments, from what we understand from your desired output, it requires a two stage aggregation, one to aggregate the number of steps per day (adding them together, using sum), the other is aggregating the different days into location level averages, using mean.
step_count %>%
group_by(date, location) %>%
summarise(sum_steps = sum(count, na.rm = TRUE)) %>%
ungroup %>%
group_by(date) %>%
summarise(avg_steps = mean(sum_steps, na.rm = TRUE))

Select rows based on multiple conditions from two independent database

I have two independent two datasets, one contains event date. Each ID has only one "Eventdate". As follows:
data1 <- data.frame("ID" = c(1,2,3,4,5,6), "Eventdate" = c("2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01"))
data1
ID Eventdate
1 1 2019-01-01
2 2 2019-02-01
3 3 2019-03-01
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
In another dataset, one ID have multiple event name (Eventcode) and its event date (Eventdate). As follows:
data2 <- data.frame("ID" = c(1,1,2,3,3,3,4,4,7), "Eventcode"=c(201,202,201,204,205,206,209,208,203),"Eventdate" = c("2019-01-01", "2019-01-01", "2019-02-11", "2019-02-15", "2019-03-01", "2019-03-15", "2019-03-10", "2019-03-20", "2019-06-02"))
data2
ID Eventcode Eventdate
1 1 201 2019-01-01
2 1 202 2019-01-01
3 2 201 2019-02-11
4 3 204 2019-02-15
5 3 205 2019-03-01
6 3 206 2019-03-15
7 4 209 2019-03-10
8 4 208 2019-03-20
9 7 203 2019-06-02
Two datasets were linked by ID. The ID of two datasets were not all the same.
I would like to select cases in data2 with conditions:
Match by ID
Eventdate in data2 >= Eventdate in data1.
If one ID has multiple Eventdates in data2, select the earliest one.
If one ID has multiple Eventcodes at one Eventdate in data2, just randomly select one.
Then merge the selected data2 into data1.
Expected results as follows:
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
or
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 202
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
Thank you very very much!
You can try this approach :
library(dplyr)
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = {
inds <- Eventdate.y >= Eventdate.x
val <- sum(inds, na.rm = TRUE)
if(val == 1) Eventcode[inds]
else if(val > 1) sample(Eventcode[inds], 1)
else NA_real_
})
# ID Eventdate.x Eventdate Eventcode
# <dbl> <chr> <chr> <dbl>
#1 1 2019-01-01 2019-01-01 201
#2 2 2019-02-01 2019-02-11 201
#3 3 2019-03-01 2019-03-01 205
#4 4 2019-04-01 NA NA
#5 5 2019-05-01 NA NA
#6 6 2019-06-01 NA NA
The complicated logic in Eventcode data is for randomness, if you are ok selecting the 1st value like Eventdate you can simplify it to :
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = Eventcode[Eventdate.y >= Eventdate.x][1])
Does this work:
library(dplyr)
data1 %>% rename(Eventdate_dat1 = Eventdate) %>% left_join(data2, by = 'ID') %>%
group_by(ID) %>% filter(Eventdate >= Eventdate_dat1) %>%
mutate(Eventdate = case_when(length(unique(Eventdate)) > 1 ~ min(Eventdate), TRUE ~ Eventdate),
Eventcode = case_when(length(unique(Eventcode)) > 1 ~ min(Eventcode), TRUE ~ Eventcode)) %>%
distinct() %>% right_join(data1, by = 'ID') %>% select(ID, 'Eventdate' = Eventdate.y, 'Eventdate.data2' = Eventdate.x, Eventcode)
# A tibble: 6 x 4
# Groups: ID [6]
ID Eventdate Eventdate.data2 Eventcode
<dbl> <chr> <chr> <dbl>
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01 NA NA
5 5 2019-05-01 NA NA
6 6 2019-06-01 NA NA

Merging rows in R while excluding certain data

Let's say I have a data frame with many subjects and many test variables:
Name Date1 Date2 `Test1` `Test2` `Test3`
<dbl> <dttm> <dttm> <chr> <chr> <chr>
1 Steve 2012-02-27 2011-11-18 <NA> <NA> 3
2 Steve 2012-02-27 2012-01-22 4 <NA> <NA>
3 Steve 2012-02-27 2014-08-09 <NA> 8 <NA>
4 Mike 2012-02-09 2007-03-29 1 2 3
5 Mike 2012-02-09 2009-07-13 <NA> 5 6
6 Mike 2012-02-09 2014-03-11 <NA> <NA> 9
7 John 2012-03-20 2013-10-22 1 2 <NA>
8 John 2012-03-20 2014-03-17 4 5 <NA>
9 John 2012-03-20 2015-06-01 <NA> 8 9
I would like to know (most likely with dplyr) how to exclude data of rows that have a Date2 that is past Date1. Then to combine the remaining data into one row by (arranged by Name) while excluding the earlier data that have more recent results. Then write a new data frame that excludes the Date2 column, all while still including the "NA"s in the data.
Also, if none of the Date2 column are before the Date1 column, I would like to keep the Name but include a row of "NA"s (as in the case of "John").
So the results should look like this:
Name Date1 `Test1` `Test2` `Test3`
<dbl> <dttm> <chr> <chr> <chr>
1 Steve 2012-02-27 4 <NA> 3
2 Mike 2012-02-09 1 5 6
3 John 2012-03-20 <NA> <NA> <NA>
Any help on this would be greatly appreciated, thank you.
This will do it with dplyr...
library(dplyr)
df2 <- df %>% filter(as.Date(Date2) <= as.Date(Date1)) %>% #remove date2 past date1
arrange(as.Date(Date2)) %>% #make sure ordered by date2
group_by(Name, Date1) %>% #group by name and date1
summarise_all(function(x) last(x[!is.na(x)])) %>% #summarise remaining (i.e. the test-columns) by the last non-NA value
right_join(df %>% distinct(Name, Date1)) %>% #join names and date1 from original df (to restore NA rows such as John)
select(-Date2) #remove Date2
df2
Name Date1 Test1 Test2 Test3
1 Steve 2012-02-27 4 <NA> 3
2 Mike 2012-02-09 1 5 6
3 John 2012-03-20 <NA> <NA> <NA>

Resources