I have the following dataset:
my_data = data.frame(id = c(1,2,3), status_2017 = c("alive", "alive", "alive"), status_2018 = c("alive", "dead", "alive"), status_2019 = c("alive", "dead", "dead"), height_2017 = rnorm(3,3,3), height_2018 = rnorm(3,3,3),
height_2019 = rnorm(3,3,3) , weight_2017 = rnorm(3,3,3), weight_2018 = rnorm(3,3,3), weight_2019 = rnorm(3,3,3))
id status_2017 status_2018 status_2019 height_2017 height_2018 height_2019 weight_2017 weight_2018 weight_2019
1 1 alive alive alive 6.505447 7.328302 4.14945261 2.4715195 7.140026 1.843526
2 2 alive dead dead -2.033761 3.553849 0.09896499 0.4159123 4.340485 1.366350
3 3 alive alive dead 3.107110 2.967456 6.52980219 1.6573734 3.397389 3.116294
My Question: If the ID is "dead" in certain years, I want to replace the corresponding height and weight information in those years to be NA.
I tried to do this using index/subsets:
dead_rows <- my_data$status_2018 == "dead" | my_data$status_2019 == "dead"
my_data[dead_rows, c("height_2018", "height_2019", "weight_2018", "weight_2019")] <- NA
id status_2017 status_2018 status_2019 height_2017 height_2018 height_2019 weight_2017 weight_2018 weight_2019
1 1 alive alive alive 6.505447 7.328302 4.149453 2.4715195 7.140026 1.843526
2 2 alive dead dead -2.033761 NA NA 0.4159123 NA NA
3 3 alive alive dead 3.107110 NA NA 1.6573734 NA NA
I dont think this is correct - ID = 3 is "alive" in 2018, but their information in 2018 has been replaced with NA:
Can someone please show me how to fix this? Is there a more "compact" way of doing this that does not require to write the names of all columns?
Assuming the years are ordered for status, height and weight, we can get the index of dead, then use that as an index for height weight columns to set to NA.
cols <- colnames(my_data)
ix <- my_data[, startsWith(cols, "status")] == "dead"
my_data[, startsWith(cols, "height")][ ix ] <- NA
my_data[, startsWith(cols, "weight")][ ix ] <- NA
my_data
# id status_2017 status_2018 status_2019 height_2017 height_2018 height_2019 weight_2017 weight_2018 weight_2019
# 1 1 alive alive alive 1.0577329 5.212449 -0.4152272 4.001597 1.248253 -1.311515
# 2 2 alive dead dead -0.4002848 NA NA 8.066520 NA NA
# 3 3 alive alive dead 5.4231554 10.285862 NA 2.619418 -1.995656 NA
Related
I have a df here (the desired output, my starting df does not have the Flag variable):
df <- data.frame(
Person = c('1','2','3'),
Date = as.Date(c('2010-09-30', '2012-11-20', '2015-03-11')),
Treatment_1 = as.Date(c('2010-09-30', '2012-11-21', '2015-03-22')),
Treatment_2 = as.Date(c('2011-09-30', 'NA', '2011-03-22')),
Treatment_3 = as.Date(c('2012-09-30', '2015-11-21', '2015-06-22')),
Surgery_1 = as.Date(c(NA, '2016-11-21', '2015-03-12')),
Surgery_2 = as.Date(c(NA, '2017-11-21', '2019-03-12')),
Surgery_3 = as.Date(c(NA, '2018-11-21', '2013-03-12')),
Flag = c('', 'Y', '')
)
and I want to derive the Flag variable based on these conditions:
For any column that starts with Treatment, set Flag to "" if Date = Treatment
For any column that starts with Surgery, set Flag to "" if Date = Surgery OR Date = Surgery +1 OR Date = Surgery - 1 (basically if the Surgery date is on the day, one day before, or one day after the Date variable, set Flag to "").
else set Flag = "Y"
I've looked into mutate_at but that rewrites the variables and assigns values of True/False.
This is wrong but this is my attempt:
df2 <- df %>%
mutate(Flag = case_when(
vars(starts_with("Treatment"), Date == . ) ~ '',
vars(starts_with("Surgery"), Date == . | Date == . - 1 | Date == . + 1) ~ '',
TRUE ~ 'Y')
)
UPDATE 2022-Aug-22
When I change a cell with the same date as the one in row 2:
df <- data.frame(
Person = c('1','2','3'),
Date = as.Date(c('2010-09-30', '2012-11-20', '2015-03-11')),
Treatment_1 = as.Date(c('2010-09-30', '2012-11-21', '2015-03-22')),
Treatment_2 = as.Date(c('2011-09-30', 'NA', '2011-03-22')),
Treatment_3 = as.Date(c('2012-09-30', '2015-11-21', '2015-06-22')),
Surgery_1 = as.Date(c(NA, '2016-11-21', '2015-03-12')),
Surgery_2 = as.Date(c(NA, '2017-11-21', '2019-03-12')),
Surgery_3 = as.Date(c(NA, '2018-11-21', '2012-11-20')),
Flag = c('', 'Y', '')
)
and then re-run the base R solution, the Flag in the second row is no longer "Y" but it should be as in that row, it doesn't meet any of the above conditions.
We can use rowwise and c_across along with any for each condition in case_when. Then, we can make a list for the Date (and +1, -1 days) for Surgery to match.
library(tidyverse)
df %>%
rowwise() %>%
mutate(Flag = case_when(
any(c_across(starts_with("Treatment")) == Date) ~ "",
any(c_across(starts_with("Surgery")) %in% c(Date, (Date +1), (Date-1))) ~ "",
TRUE ~ "Y"
))
Output
Person Date Treatment_1 Treatment_2 Treatment_3 Surgery_1 Surgery_2 Surgery_3 Flag
<chr> <date> <date> <date> <date> <date> <date> <date> <chr>
1 1 2010-09-30 2010-09-30 2011-09-30 2012-09-30 NA NA NA ""
2 2 2012-11-20 2012-11-21 NA 2015-11-21 2016-11-21 2017-11-21 2018-11-21 "Y"
3 3 2015-03-11 2015-03-22 2011-03-22 2015-06-22 2015-03-12 2019-03-12 2013-03-12 ""
Update
Here is a possible base R solution that is a lot quicker than tidyverse. This could be done in one line of code, but I decided that readability is better. First, I duplicate the Surgery columns so that we have +1 day and -1 day, and then convert these columns to character. Then, I subset the Treatment columns and convert to character. I convert to character as you cannot compare Date with %in% or ==. Then, I bind the date, treatment, and surgery columns together (a). Then, I use an ifelse for if the Date is in any of the columns but doing it row by row with apply, then we return "" and if not then return Y. Then, I bind the result back to the original dataframe (minus Flag from your original dataframe).
dup_names <- colnames(df)[startsWith(colnames(df), "Surgery")]
surgery <-
cbind(df[dup_names], setNames(df[dup_names] + 1, paste0(dup_names, "_range1")))
surgery <-
sapply(cbind(surgery, setNames(df[dup_names] - 1, paste0(
dup_names, "_range2"
))), as.character)
treatment <-
sapply(df[startsWith(colnames(df), "Treatment")], as.character)
a <- cbind(Date = as.character(df$Date), treatment, surgery)
cbind(subset(df, select = -Flag),
Flag = ifelse(apply(a[,1]==a[,2:ncol(a)], 1, any, na.rm = TRUE), "", "Y"))
Benchmark
Here is an alternative using across approach:
library(tidyverse)
df %>%
mutate(across(starts_with("Treatment"), ~as.numeric(. %in% Date), .names ="new_{.col}"),
across(starts_with("Surgery"), ~as.numeric(. %in% c(Date, Date+1, Date-1)), .names ="new_{.col}")) %>%
mutate(Flag = ifelse(rowSums(select(., contains('new')))==1, "", "Y"), .keep="used") %>%
bind_cols(df)
Flag Person Date Treatment_1 Treatment_2 Treatment_3 Surgery_1 Surgery_2 Surgery_3
1 1 2010-09-30 2010-09-30 2011-09-30 2012-09-30 <NA> <NA> <NA>
2 Y 2 2012-11-20 2012-11-21 <NA> 2015-11-21 2016-11-21 2017-11-21 2018-11-21
3 3 2015-03-11 2015-03-22 2011-03-22 2015-06-22 2015-03-12 2019-03-12 2013-03-12
Updated to add data.table approach
If you want a data.table approach, here it is:
df[melt(df, id=c(1,2))[,flag:=fifelse(
(str_starts(variable,"T") & value==Date) |
(str_starts(variable,"S") & abs(value-Date)<=1),"", "Y")][
, .(flag=min(flag,na.rm=T)), Person], on=.(Person)]
Output
Person Date Treatment_1 Treatment_2 Treatment_3 Surgery_1 Surgery_2 Surgery_3 flag
1: 1 2010-09-30 2010-09-30 2011-09-30 2012-09-30 <NA> <NA> <NA>
2: 2 2012-11-20 2012-11-21 <NA> 2015-11-21 2016-11-21 2017-11-21 2018-11-21 Y
3: 3 2015-03-11 2015-03-22 2011-03-22 2015-06-22 2015-03-12 2019-03-12 2013-03-12
I like Andrew's approach, but I was working on this when his answer came in, so here it is in case you are interested
df %>% inner_join(
pivot_longer(df, cols=Treatment_1:Surgery_3) %>%
mutate(flag=case_when(
(str_starts(name,"T") & value==Date) | (str_starts(name,"S") & abs(value-Date)<=1) ~ "",
TRUE ~"Y")) %>%
group_by(Person) %>%
summarize(flag = min(flag))
)
Output:
Person Date Treatment_1 Treatment_2 Treatment_3 Surgery_1 Surgery_2 Surgery_3 flag
1 1 2010-09-30 2010-09-30 2011-09-30 2012-09-30 <NA> <NA> <NA>
2 2 2012-11-20 2012-11-21 <NA> 2015-11-21 2016-11-21 2017-11-21 2018-11-21 Y
3 3 2015-03-11 2015-03-22 2011-03-22 2015-06-22 2015-03-12 2019-03-12 2013-03-12
I have a dataframe like this in R:
Start date
End date
Date 1
Date 2
Date 3
Date 4
11/12/2018
29/11/2019
08/03/2021
NA
NA
NA
07/03/2018
24/04/2019
08/03/2021
12/09/2016
NA
NA
04/06/2018
23/04/2019
08/03/2021
02/10/2017
05/10/2018
NA
26/07/2018
29/08/2019
08/03/2021
03/08/2015
02/10/2017
23/01/2017
I want to create a new column in R that says: If Date 1, Date 2, Date 3 or Date 4 is between Start Date and End date, it should return 1, 0 otherwise, as the table below:
Start date
End date
Date 1
Date 2
Date 3
Date 4
Change
11/12/2018
29/11/2019
08/03/2021
NA
NA
NA
0
07/03/2018
24/04/2019
08/03/2021
12/09/2016
NA
NA
0
04/06/2018
23/04/2019
08/03/2021
02/10/2017
05/10/2018
NA
1
26/07/2018
29/08/2019
08/03/2021
03/08/2015
02/10/2017
23/01/2017
0
Does anyone have a suggestion on how to solve this? Thank you :)
It'll make it much easier for people to help you if you can post code / data which we can run directly. The easiest way to do this is to use a handy R function called dput, which generates instructions to exactly recreate any R object. So you might run dput(MY_DATA), or if your data is much larger than needed to demonstrate your question, dput(head(MY_DATA)) to get the first six rows, and paste the output of that into your question. </PSA>
Here's code to generate your example data:
my_data <- data.frame(
stringsAsFactors = FALSE,
Start.date = c("11/12/2018", "07/03/2018", "04/06/2018", "26/07/2018"),
End.date = c("29/11/2019", "24/04/2019", "23/04/2019", "29/08/2019"),
Date.1 = c("08/03/2021", "08/03/2021", "08/03/2021", "08/03/2021"),
Date.2 = c(NA, "12/09/2016", "02/10/2017", "03/08/2015"),
Date.3 = c(NA, NA, "05/10/2018", "02/10/2017"),
Date.4 = c(NA, NA, NA, "23/01/2017")
)
Here's a tidyverse approach to first convert your day/month/year dates into data in R's Date type using lubridate::dmy, then to compare each of Date.1 thru Date.4 against your start dates, and then finally to show if there are any 1's (within range).
library(dplyr); library(lubridate)
my_data %>%
mutate(across(.fns = ~dmy(.x))) %>%
mutate(across(.cols = starts_with("Date"),
.fns = ~coalesce(.x >= Start.date & .x <= End.date, FALSE)*1)) %>%
mutate(Change = pmax(Date.1, Date.2, Date.3, Date.4))
coalesce(..., FALSE) used here to treat NA like FALSE.
(...)*1 to convert TRUE/FALSE to 1/0.
pmax(...) to grab the largest of the 1/0's, i.e. "are there any 1's?"
Edit: alternative to leave Date columns intact:
my_data %>%
mutate(across(.fns = ~dmy(.x))) %>%
mutate(across(.cols = starts_with("Date"),
.names = "Check_{.col}",
.fns = ~coalesce(.x >= Start.date & .x <= End.date, FALSE)*1)) %>%
rowwise() %>%
mutate(Change = max(c_across(starts_with("Check")))) %>%
select(-starts_with("Check"))
Start.date End.date Date.1 Date.2 Date.3 Date.4 Change
<date> <date> <date> <date> <date> <date> <dbl>
1 2018-12-11 2019-11-29 2021-03-08 NA NA NA 0
2 2018-03-07 2019-04-24 2021-03-08 2016-09-12 NA NA 0
3 2018-06-04 2019-04-23 2021-03-08 2017-10-02 2018-10-05 NA 1
4 2018-07-26 2019-08-29 2021-03-08 2015-08-03 2017-10-02 2017-01-23 0
library(tidyverse)
library(lubridate)
df <- read.table(textConnection("start_date;end_date;date_1;date_2;date_3;date_4
11/12/2018;29/11/2019;08/03/2021;NA;NA;NA
07/03/2018;24/04/2019;08/03/2021;12/09/2016;NA;NA
04/06/2018;23/04/2019;08/03/2021;02/10/2017;05/10/2018;NA
26/07/2018;29/08/2019;08/03/2021;03/08/2015;02/10/2017;23/01/2017"),
sep=";",
header = TRUE)
df %>%
mutate(
across(everything(), lubridate::dmy),
change = ((date_1 > start_date & date_1 < end_date) |
(date_2 > start_date & date_2 < end_date) |
(date_3 > start_date & date_3 < end_date)
) %>%
coalesce(FALSE) %>%
as.integer()
)
#> start_date end_date date_1 date_2 date_3 date_4 change
#> 1 2018-12-11 2019-11-29 2021-03-08 <NA> <NA> <NA> 0
#> 2 2018-03-07 2019-04-24 2021-03-08 2016-09-12 <NA> <NA> 0
#> 3 2018-06-04 2019-04-23 2021-03-08 2017-10-02 2018-10-05 <NA> 1
#> 4 2018-07-26 2019-08-29 2021-03-08 2015-08-03 2017-10-02 2017-01-23 0
Using the dataset below:
df <- structure(list(test = c("1st", "2nd", "3rd", "1st", "2nd", "3rd", "1st", "2nd", "3rd"),
id = c("PID1", "PID1", "PID1", "PID2", "PID2", "PID2", "PID3", "PID3", "PID3"),
date = c("2020-01-01", "2020-01-13", "2020-01-17", "2020-01-01", "2020-01-13", "2020-01-20", "2020-01-01", "2020-01-14", "2020-01-18"),
status_1 = c("Symp", "Symp", "uninfected", "Asymp", "Symp", "uninfected", "Asymp", "Asymp", "uninfected"),
status_2 = c("Symp", "Symp", "uninfected", "pre-Symp", "Symp", "uninfected", "Asymp", "Asymp", "uninfected")),
class = "data.frame",
row.names = c(NA, -9L)
)
I'd like to identify individuals whose status_1 is Asymp at the first test and check whether within 14 days they are Symp.
Such individuals should have their status_2 change to pre-Symp, otherwise, their status should remain the same such e.g. individual with id PID3.
This individual is Asymp after the first test and does not transition to Symp within 14 days so his status remains Asymp
Here's the code I tried to put together but I keep running into errors:
df <- df %>%
mutate(status_2 = case_when(test == "1st" &
status_1 == "Asymp" &
status_1[date + 14] != "Symp" ~ "pre-Symp",
TRUE ~ status_1))
You can create your own custom function which checks for the status within that time interval.
library(dplyr)
check_status <- function(x, date) {
if(first(x) == 'Asymp' & any(x == 'Symp')) {
if (date[which.max(x == 'Symp')] - first(date) < 14)
x[1] <- 'pre-Symp'
}
return(x)
}
Now apply this function by group :
df %>%
mutate(date = as.Date(date)) %>%
group_by(id) %>%
mutate(status_2 = check_status(status_1, date))
# test id date status_1 status_2
# <chr> <chr> <date> <chr> <chr>
#1 1st PID1 2020-01-01 Symp Symp
#2 2nd PID1 2020-01-13 Symp Symp
#3 3rd PID1 2020-01-17 uninfected uninfected
#4 1st PID2 2020-01-01 Asymp pre-Symp
#5 2nd PID2 2020-01-13 Symp Symp
#6 3rd PID2 2020-01-20 uninfected uninfected
#7 1st PID3 2020-01-01 Asymp Asymp
#8 2nd PID3 2020-01-14 Asymp Asymp
#9 3rd PID3 2020-01-18 uninfected uninfected
Lets say I have three variables id, date, trad (which has 3 values and can be anyone of them at any time point):
library(tidyverse)
dput(df)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2), date = structure(c(16436, 16437, 16438, 16439,
16440, 16441, 16442, 16443, 16444, 16445, 16446, 16447, 16448,
16449, 16450, 16451, 16452, 16453, 16454), class = "Date"), trad = c("Free",
"Suspended", "Suspended", "Free", "Suspended", "Withdrawn", "Withdrawn",
"Free", "Withdrawn", "Free", "Free", "Withdrawn", "Suspended",
"Withdrawn", "Withdrawn", "Free", "Withdrawn", "Suspended", "Free"
)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-19L), spec = structure(list(cols = list(id = structure(list(), class = c("collector_double",
"collector")), date = structure(list(format = "%d/%m/%Y"), class = c("collector_date",
"collector")), trad = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
df
# A tibble: 19 x 3
id date trad
<dbl> <date> <chr>
1 1 2015-01-01 Free
2 1 2015-01-02 Suspended
3 1 2015-01-03 Suspended
4 1 2015-01-04 Free
5 1 2015-01-05 Suspended
6 1 2015-01-06 Withdrawn
7 1 2015-01-07 Withdrawn
8 1 2015-01-08 Free
9 1 2015-01-09 Withdrawn
10 1 2015-01-10 Free
11 1 2015-01-11 Free
12 1 2015-01-12 Withdrawn
13 1 2015-01-13 Suspended
14 1 2015-01-14 Withdrawn
15 1 2015-01-15 Withdrawn
16 1 2015-01-16 Free
17 2 2015-01-17 Withdrawn
18 2 2015-01-18 Suspended
19 2 2015-01-19 Free
I would like to generate new columns with the start and end dates of when a period starts. A period starts when trad moves to status "Withdrawn" with the cavet that if there is a status "Suspended" before the "Withdrawn" row, the start date moves to this row. If there are multiple rows of "Suspended" before "Withdrawn", then start begins with the first "Suspended". Similarly, the end date is when trad goes to Free after being in "Withdrawn". This is required final dataset:
dfnew
# A tibble: 19 x 6
id date trad start end period
<dbl> <date> <chr> <date> <date> <dbl>
1 1 2015-01-01 Free NA NA NA
2 1 2015-01-02 Suspended NA NA NA
3 1 2015-01-03 Suspended NA NA NA
4 1 2015-01-04 Free NA NA NA
5 1 2015-01-05 Suspended 2015-01-05 NA 1
6 1 2015-01-06 Withdrawn NA NA 1
7 1 2015-01-07 Withdrawn NA NA 1
8 1 2015-01-08 Free NA 2015-01-08 1
9 1 2015-01-09 Withdrawn 2015-01-09 NA 2
10 1 2015-01-10 Free NA 2015-01-10 2
11 1 2015-01-11 Free NA NA NA
12 1 2015-01-12 Withdrawn 2015-01-12 NA 3
13 1 2015-01-13 Suspended NA NA 3
14 1 2015-01-14 Withdrawn NA NA 3
15 1 2015-01-15 Withdrawn NA NA 3
16 1 2015-01-16 Free NA 2015-01-16 NA
17 2 2015-01-17 Withdrawn 2015-01-17 NA 1
18 2 2015-01-18 Suspended NA NA 1
19 2 2015-01-19 Free NA 2015-01-19 1
There is no pattern in trad so you could have any sequence of "Withdrawn"/"Suspended" before "Free" so a solution something like this doesn't work (in theory it could, but I would need too many conditions to implement it):
dfnew <- df %>%
group_by(id)
mutate(start = ifelse(trad == "Withdrawn" & lag(trad == "Free"), date, NA))
These questions are helpful but don't answer the question:
How to extract the previous n rows where a certain column value cannot be a particular value?
R - Conditional lagging - How to lag a certain amount of cells until a condition is met?
Would anyone have a flexible solution?
Not very flexible, but at least a try.
I don't know what happens when we have sequence Suspended, Suspended, Withdrawn, Withdrawn.
For example change trad on 2015-01-04 to Suspended. When is the start date in this case?
I gave 2 solutions, first makes start date on 2015-01-02 and the second on 2015-01-05
dfnew1 <- df %>%
mutate(startGroups = cumsum(trad == "Free")) %>%
group_by(startGroups) %>% # make a group from every occurance of "Free" in trad
mutate(wds = cumsum(trad == "Withdrawn"),
start = ifelse(max(wds) > 0 & row_number() == 2, date, NA) # if there is any "Withdrawn" in the group set start date right after "Free"
) %>%
ungroup() %>%
mutate(endGroups = cumsum(!is.na(start))) %>%
group_by(endGroups) %>% # group on every open trade now
mutate(frees = cumsum(trad == "Free"),
end = ifelse(frees == 1 & endGroups > 0, date, NA) #end on first occurance of "Free" in trad column
) # %>% select(-startGroups, wds, endGroups, frees) # remove cols
dfnew2 <- df %>%
mutate(startGroups = cumsum(trad == "Free")) %>%
group_by(startGroups) %>% # make a group from every occurance of "Free" in trad
mutate(wds = cumsum(trad == "Withdrawn"),
start = ifelse(
(trad == "Suspended" & lead(trad) == "Withdrawn" & lead(wds) == 1 |
trad == "Withdrawn" & lag(trad) != "Suspended" & wds == 1),
date, NA) # first trad in group. Other option:
) %>%
ungroup() %>%
mutate(endGroups = cumsum(!is.na(start))) %>%
group_by(endGroups) %>%
mutate(frees = cumsum(trad == "Free"),
end = ifelse(frees == 1 & endGroups > 0, date, NA)
) #%>% select(-startGroups, wds, endGroups, frees)
My objective is to write a more efficient code that creates a calendar that distributes dealer's customers to different days so that:
All days have at least min customers per day and (min depends on the dealer)
All days have at most max customers per day and (max depends on the dealer)
Whenever a customer is moved to a different day, he should be moved to the closest one possible (don't bother about ties - if the distance is the same then 1 day before or 1 day after is the same)
Example:
dealer_id <- rep(c("ABC123","DEF234","GHJ456"), each = 4)
date <- as.Date(rep(c("2018-01-01", "2018-01-02", "2018-01-03", "2018-01-04"), times = 3))
cust_pos_1 <- c("MCA1", "MCA2", "MCA3", "MCA4", "MCA5", "MCA6", NA, NA, "MCA9", "MCA10", "MCA11", "MCA12")
cust_pos_2 <- c("MCB1", "MCB2", NA, NA, "MCB5", NA, NA, NA, "MCB9", NA, "MCB11", NA)
cust_pos_3 <- c("MCC1", "MCC2", NA, NA, NA, NA, NA, NA, "MCC9", NA, NA, NA)
df <- data.frame(dealer_id, date, cust_pos_1, cust_pos_2, cust_pos_3)
settings <- data.frame(dealer_id = c("ABC123","DEF234","GHJ456"), min_daily = c(2, 0, 1), max_daily = c(3, 1, 2))
Gives us the input data and the dealer settings:
dealer_id date cust_pos_1 cust_pos_2 cust_pos_3
ABC123 2018-01-01 MCA1 MCB1 MCC1
ABC123 2018-01-02 MCA2 MCB2 MCC2
ABC123 2018-01-03 MCA3 NA NA
ABC123 2018-01-04 MCA4 NA NA
DEF234 2018-01-01 MCA5 MCB5 NA
DEF234 2018-01-02 MCA6 NA NA
DEF234 2018-01-03 NA NA NA
DEF234 2018-01-04 NA NA NA
GHJ456 2018-01-01 MCA9 MCB9 MCC9
GHJ456 2018-01-02 MCA10 NA NA
GHJ456 2018-01-03 MCA11 MCB11 NA
GHJ456 2018-01-04 MCA12 NA NA
dealer_id min_daily max_daily
ABC123 2 3
DEF234 0 1
GHJ456 1 2
and the output data after the code runs should look like this:
dealer_id date cust_pos_1 cust_pos_2 cust_pos_3
ABC123 2018-01-01 MCA1 MCB1 NA
ABC123 2018-01-02 MCA2 MCB2 NA
ABC123 2018-01-03 MCA3 MCC1 NA
ABC123 2018-01-04 MCA4 MCC2 NA
DEF234 2018-01-01 MCA5 NA NA
DEF234 2018-01-02 MCA6 NA NA
DEF234 2018-01-03 MCB5 NA NA
DEF234 2018-01-04 NA NA NA
GHJ456 2018-01-01 MCA9 MCB9 NA
GHJ456 2018-01-02 MCA10 MCC9 NA
GHJ456 2018-01-03 MCA11 MCB11 NA
GHJ456 2018-01-04 MCA12 NA NA
Because of the settings - the customers had to be re-distributed according to the above rules.
There is a rule for the settings table as well! The difference between min and max is always going to be 1.
It's trivial to say that there is more than 1 way for this to be solved as the fact that we don't care whether the customer is moved x days before of x days after means that we can get different (and better!) solutions.
Now. That being said, I've solved it using a loop that takes ages to run (my data frames are massive - I've got to create this calendar for 5 years and for 150 dealers with hundreds of customers).
My question is: Is there a way to do it using maybe dplyr or data.table or something else so that it runs faster?
Thanks.
The code bellow will fill out the empty space within acceptable cells from the customers that are outside the range of max_daily setting. However, one potential problem may be that the code relocate customers far away from their original date.
The code provided is compatible with dplyr pipe operator and is really quick.
calendar_sort <- function(df){
df <- as.data.frame(df)
#
nr <- nrow(df)
nc <- ncol(df)
nc_p <- nc - 2 #if you have more than 2 col for DealerID and Date, change this
ttl_cn <- sum(!is.na(df[,3:nc])) # total count of non_NAs
mx <- ceiling(ttl_cn / nr)
rng_nnac <- sum(!is.na(df[,3:(2+mx)])) # count of non-NAs within the range
left_overs <- df[,(mx+3):nc]
left_overs <- left_overs[!is.na(left_overs)]
lo_c <- length(left_overs) # leftover non-NA count
ttl_s <- (nr*nc_p) # total number of cells (total space)
df[,(mx+3):nc] <- NA
A <- c(left_overs, rep(NA, ttl_s - rng_nnac -lo_c))
df[is.na(df)] <- A
B <- df[,3:nc]
B <- B[!is.na(B)]
B <- c(B, rep(NA, ttl_s - length(B)))
df[,3:nc] <- B
return(df)
}
result <- df %>% dplyr::group_by(Dealer) %>% do(calendar_sort(.))