Rank most recent scores of students within a given date - 30 days window - r

Following is what my dataframe/data.table looks like. The rank column is my desired calculated field.
library(data.table)
df <- fread('
Name Score Date Rank
John 42 1/1/2018 3
Rob 85 12/31/2017 2
Rob 89 12/26/2017 1
Rob 57 12/24/2017 1
Rob 53 08/31/2017 1
Rob 72 05/31/2017 2
Kate 87 12/25/2017 1
Kate 73 05/15/2017 1
')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]
I am trying to calculate the rank of each student at every given point in time in the data within a 30 day windows. For that, I need to fetch the most recent scores of all students at a given point in time and then pass the rank function.
In the 1st row, as of 1/1/2018, John has two more competitors in a past 30 day window: Rob with the most recent score of 85 in 12/31/2017 AND Kate with the most recent score of 87 in 12/25/2017 and both of these dates fall within the 1/1/2018 - 30 Day Window. John gets a rank of 3 with the lowest score of 42. If only one students falls within date(at a given row) - 30 day window, then the rank is 1.
In the 3rd row the date is 12/26/2017. So Rob's score as of 12/26/2017 is 89. There is only one case of another student that falls in the time window of 12/26/2017 - 30 days and that is the most recent score(87) of kate on 12/25/2017. So within the time window of (12/26/2017) - 30 , Rob's score of 89 is higher than Kate's score of 87 and therefore Rob gets rank 1.
I was thinking about using the framework from here Efficient way to perform running total in the last 365 day window but struggling to think of a way to fetch all recent score of all students at a given point in time before using rank.

This seems to work:
ranks = df[.(d_dn = Date - 30L, d_up = Date), on=.(Date >= d_dn, Date <= d_up), allow.cart=TRUE][,
.(LatestScore = last(Score)), by=.(Date = Date.1, Name)]
setorder(ranks, Date, -LatestScore)
ranks[, r := rowid(Date)]
df[ranks, on=.(Name, Date), r := i.r]
Name Score Date Rank r
1: John 42 2018-01-01 3 3
2: Rob 85 2017-12-31 2 2
3: Rob 89 2017-12-26 1 1
4: Rob 57 2017-12-24 1 1
5: Rob 53 2017-08-31 1 1
6: Rob 72 2017-05-31 2 2
7: Kate 87 2017-12-25 1 1
8: Kate 73 2017-05-15 1 1
... using last since the Cartesian join seems to sort and we want the latest measurement.
How the update join works
The i. prefix means it's a column from i in the x[i, ...] join, and the assignment := is always in x. So it's looking up each row of i in x and where matches are found, copying values from i to x.
Another way that is sometimes useful is to look up x rows in i, something like df[, r := ranks[df, on=.(Name,Date), x.r]] in which case x.r is still from the ranks table (now in the x position relative to the join).
There's also...
ranks = df[CJ(Name = Name, Date = Date, unique=TRUE), on=.(Name, Date), roll=30, nomatch=0]
setnames(ranks, "Score", "LatestScore")
# and then use the same last three lines above
I'm not sure about efficiency of one vs another, but I guess it depends on number of Names, frequency of measurement and how often measurement days coincide.

A solution that uses data.table though not sure if it is the most efficient usage:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Rank=frank(-c(iScore[1L], .SD[Name != iName, max(Score), by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
Explanation:
1) The outer square brackets do a non-equi join within a date range (i.e. 30days ago and latest date for each row). Try studying the below output against the input data:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
c(.(RowGroup=.GRP),
.SD[, .(Name, Score, Date, OrigDate, iName, iScore, iDate, StartDate, EndDate)]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
2) .EACHI is to perform j calculations for each row of i.
3) Inside j, iScore[1L] is the score for the current row, .SD[Name != iName] means taking scores not corresponding to the student in the current row. Then, we use the max(Score) for each student of those students within the 30days window.
4) Concatenate all these scores and calculate the rank for the score of the current row while taking care of ties by taking the first tie.
Note:
see ?data.table to understand what i, j, by, on and .EACHI refers to.
EDIT after comments by OP:
I would add a OrigDate column and find those that matches the latest date.
df[, OrigDate := Date]
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Name=iName, Score=iScore, Date=iDate,
Rank=frank(-c(iScore[1L],
.SD[Name != iName, Score[OrigDate==max(OrigDate)], by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]

I came up with following partial solution, encountered however problem - is it possible that there will be two people occuring with the same date?
if not, have a look at following piece of code:
library(tidyverse) # easy manipulation
library(lubridate) # time handling
# This function can be added to
get_top <- function(df, date_sel) {
temp <- df %>%
filter(Date > date_sel - months(1)) %>% # look one month in the past from given date
group_by(Name) %>% # and for each occuring name
summarise(max_score = max(Score)) %>% # find the maximal score
arrange(desc(max_score)) %>% # sort them
mutate(Rank = 1:n()) # and rank them
temp
}
Now, you have to find the name in the table, for given date and return its rank.

library(data.table)
library(magrittr)
setorder(df, -Date)
fun <- function(i){
df[i:nrow(df), head(.SD, 1), by = Name] %$%
rank(-Score[Date > df$Date[i] - 30])[1]
}
df[, rank := sapply(1:.N, fun)]

This can be done by joining to df those rows of df that are within 30 days behind it or the same date and have higher or equal scores. Then for each original row and joined row Name get the joined row Name that is the most recent. The count of the remaining joined rows for each of the original df rows is the rank.
library(sqldf)
sqldf("with X as
(select a.rowid r, a.*, max(b.Date) Date
from df a join df b
on b.Date between a.Date - 30 and a.Date and b.Score >= a.Score
group by a.rowid, b.Name)
select Name, Date, Score, count(*) Rank
from X
group by r
order by r")
giving:
Name Date Score Rank
1 John 2018-01-01 42 3
2 Rob 2017-12-31 85 2
3 Rob 2017-12-26 89 1
4 Rob 2017-12-24 57 1
5 Rob 2017-08-31 53 1
6 Rob 2017-05-31 72 2
7 Kate 2017-12-25 87 1
8 Kate 2017-05-15 73 1

A tidyverse solution (dplyr + tidyr):
df %>%
complete(Name,Date) %>%
group_by(Name) %>%
mutate(last_score_date = `is.na<-`(Date,is.na(Score))) %>%
fill(Score,last_score_date) %>%
filter(!is.na(Score) & Date-last_score_date <30) %>%
group_by(Date) %>%
mutate(Rank = rank(-Score)) %>%
right_join(df)
# # A tibble: 8 x 5
# # Groups: Date [?]
# Name Date Score last_score_date Rank
# <chr> <date> <int> <date> <dbl>
# 1 John 2018-01-01 42 2018-01-01 3
# 2 Rob 2017-12-31 85 2017-12-31 2
# 3 Rob 2017-12-26 89 2017-12-26 1
# 4 Rob 2017-12-24 57 2017-12-24 1
# 5 Rob 2017-08-31 53 2017-08-31 1
# 6 Rob 2017-05-31 72 2017-05-31 2
# 7 Kate 2017-12-25 87 2017-12-25 1
# 8 Kate 2017-05-15 73 2017-05-15 1
We add all missing combinations of Date and Name
then we create a column for the last_score_date, equal to Date when score isn't NA.
by filling NAs down Score has become the latest score
we filter out NAs and keep only scores that have < 30 days of age
That's our table of valid scores by dates
From there it's easy to add ranks
and a final right_join on the original table gives us the expected output
data
library(data.table)
df <- fread('
Name Score Date
John 42 01/01/2018
Rob 85 12/31/2017
Rob 89 12/26/2017
Rob 57 12/24/2017
Rob 53 08/31/2017
Rob 72 05/31/2017
Kate 87 12/25/2017
Kate 73 05/15/2017
')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]

Related

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

subset data based on condition in r [duplicate]

This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 1 year ago.
I want to select those household where all the member's age is greater than 20 in r.
household Members_age
100 75
100 74
100 30
101 20
101 50
101 60
102 35
102 40
102 5
Here two household satisfy the condition. Household 100 and 101.
How to do it in r?
what I did is following but it's not working.
sqldf("select household,Members_age from data group by household having Members_age > 20")
household Members_age
100 75
102 35
Please suggest. Here is the sample dataset
library(dplyr)
library(sqldf)
data <- data.frame(household = c(100,100,100,101,101,101,102,102,102),
Members_age = c(75,74,30,20,50,60,35,40,5))
You can use ave.
data[ave(data$Members_age, data$household, FUN=min) > 20,]
# household Members_age
#1 100 75
#2 100 74
#3 100 30
or only the households.
unique(data$household[ave(data$Members_age, data$household, FUN=min) > 20])
#[1] 100
I understand SQL's HAVING clause, but your request "all member's age is greater than 20" does not match your sqldf output. This is because HAVING is really only looking at the first row for each household, which is why we see 102 (and shouldn't) and we don't see 101 (shouldn't as well).
I suggest to implement your logic, you would change your sqldf code to the following:
sqldf("select household,Members_age from data group by household having min(Members_age) > 20")
# household Members_age
# 1 100 30
which is effectively the SQL analog of GKi's ave answer.
An alternative:
library(dplyr)
data %>%
group_by(household) %>%
filter(all(Members_age > 20)) %>%
ungroup()
# # A tibble: 3 x 2
# household Members_age
# <dbl> <dbl>
# 1 100 75
# 2 100 74
# 3 100 30
and if you just need one row per household, then add %>% distinct(household) or perhaps %>% distinct(household, .keep_all = TRUE).
But for base R, I think nothing is likely to be better than GKi's use of ave.

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

Merge/Join Data Frame / Table based on criteria - > or <

I have a data frame with weekly data by Section. Each Section has approx 104 weeks worth of data and there is 83 sections in total.
I have a second data frame with the Start and End week by Section that I want to filter the main data frame on.
In both tables the Week is a combination of Year and Week e.g. 201501 and is always from weeks 1 to 52.
So in the example below I want to filter Section A by weeks 201401 to 201404, Section B by weeks 201551 to 201603.
I initially thought I could add an additional column to my Weeks_Filter data frame that is a sequential number from the start and end of the the weeks for each section (duplicating each row for each week), then merge the 2 tables and keep all the data from the Weeks_Filter table (all.y = TRUE) because this worked on a small sample I did but I don't know how to add the sequential weeks since they can span different years.
Week <- c("201401","201402","201403","201404","201405", "201451", "201552", "201601", "201602", "201603")
Section <- c(rep("A",5),rep("B",5))
df <- data.frame(cbind(Week, Section))
Section <- c("A", "B")
Start <- c("201401","201551")
End <- c("201404","201603")
Weeks_Filter <- data.frame(cbind(Section, Start, End))
The latest development version of data.table adds non-equi joins (and in the older ones you can use foverlaps):
setDT(df) # convert to data.table in place
setDT(Weeks_Filter)
# fix the column types - you have factors currently, converting to integer
df[, Week := as.integer(as.character(Week))]
Weeks_Filter[, `:=`(Start = as.integer(as.character(Start)),
End = as.integer(as.character(End)))]
# the actual magic
df[df[Weeks_Filter, on = .(Section, Week >= Start, Week <= End), which = T]]
# Week Section
#1: 201401 A
#2: 201402 A
#3: 201403 A
#4: 201404 A
#5: 201552 B
#6: 201601 B
#7: 201602 B
#8: 201603 B
Using dplyr you can
combine your data frames
group by Section
filter based on the Start and End columns
One problem is that your 'weeks' are characters and become factors the way you've encoded them. I took the shortcut and just made them numeric, but I'd recommend using lubridate to make these proper Date class vectors.
library(dplyr)
tempdf <- full_join(df, Weeks_Filter)
tempdf$Week <- as.numeric(as.character(tempdf$Week))
tempdf$Start <- as.numeric(as.character(tempdf$Start))
tempdf$End <- as.numeric(as.character(tempdf$End))
tempdf_filt <- tempdf %>%
group_by(Section) %>%
filter(Week >= Start,
Week <= End)
It looks like there's a problem in your data that "201451" should be "201551", but otherwise returns what you want:
> tempdf_filt
Source: local data frame [8 x 4]
Groups: Section [2]
Week Section Start End
(dbl) (fctr) (dbl) (dbl)
1 201401 A 201401 201404
2 201402 A 201401 201404
3 201403 A 201401 201404
4 201404 A 201401 201404
5 201552 B 201551 201603
6 201601 B 201551 201603
7 201602 B 201551 201603
8 201603 B 201551 201603
Perhaps creating a vector of all desired weeks would work for the filter. Here is a rough example using base R:
# get weeks
allWeeks <- as.character(1:52)
allWeeks <- ifelse(nchar(allWeeks)==1, paste0("0",allWeeks), allWeeks)
# get all year-weeks
allWeeks <- paste0(2014:2015, allWeeks)
# filter vector to select desired weeks
keepWeeks <- keepWeeks[grep("201(40[1-4]|55[12]|60[123]))", allWeeks)]
dfKeeper <- df[df$Week %in% keepWeeks,]
I tried to construct a regular expression that would capture the periods that you want, but you may have to adjust it a bit.
require(data.table)
df <- merge(df, Weeks_Filter)
df[, -1] <- apply(df[, -1], 2, function(x) as.numeric(as.character(x)))
df <- data.table(df)
df[Week >= Start & Week <= End, .SD, by = Section]
The Output is,
Section Start End Week
1: A 201401 201404 201401
2: A 201401 201404 201402
3: A 201401 201404 201403
4: A 201401 201404 201404
5: B 201551 201603 201552
6: B 201551 201603 201601
7: B 201551 201603 201602
8: B 201551 201603 201603

Creating a vector containing total quantities sold per delivery term

Have a look at the simplified table below. I want for each product a vector containing the quantities sold within each delivery time. A delivery time is defined as 4 days. So if we look at product A, we see that it starts at 03/12/15 and within the first delivery term (until 07/12/15) it has sold a quantity of 4. The second delivery term starts at 08/12/15 and ends at 12/12/15. So for this period there is 1 quantity sold. The following delivery term starts at 13/12/15 and ends at 17/12/15. During these period there are no quantities sold and thus for this period the vector must have a value of 0. In the last period, finally, 2 products are sold. So basically the problem here is that information regarding the periods were no products are sold is missing.
Any ideas on how the vector I want can be created using R? I've been thinking of for or while loops, but these do not seem to give the requested results. Note that the code must be applicable on a real dataset containing over 1000 product categories, so it has to be 'automatized' in one way.
I would be very gratefull if somebody could point me in the right direction.
Product Quantity Date
A 1 03/12/15
A 2 04/12/15
A 1 05/12/15
A 1 08/12/15
A 1 17/12/16
A 1 18/12/16
B 1 19/12/15
B 2 10/05/15
B 2 11/05/15
C 1 01/06/15
C 1 02/06/15
C 1 12/06/15
Assume that dt is the dataset you provided. You'll get a better understanding of the process if you run it step by step (and maybe with an even simpler dataset).
library(lubridate)
library(dplyr)
# create date time columns
dt$Date = dmy(dt$Date)
dt %>%
group_by(Product) %>%
do(data.frame(days = seq(min(.$Date), max(.$Date), by="1 day"))) %>% # create all combinations between product and days
mutate(dist = as.numeric(difftime(days,min(days), units="days"))) %>% # create distance of each day with min date
ungroup() %>%
left_join(dt, by=c("Product"="Product","days"="Date")) %>% # join info to get quantities for each day
mutate(Quantity = ifelse(is.na(Quantity), 0, Quantity), # replace NAs with 0s
id = floor(dist/5 + 1)) %>% # create the 4 period id
group_by(Product, id) %>%
summarise(Sum = sum(Quantity),
min_date = min(days),
max_date = max(days)) %>%
ungroup
# Product id Sum min_date max_date
# 1 A 1 4 2015-12-03 2015-12-07
# 2 A 2 1 2015-12-08 2015-12-12
# 3 A 3 0 2015-12-13 2015-12-17
# 4 A 4 0 2015-12-18 2015-12-22
# 5 A 5 0 2015-12-23 2015-12-27
# 6 A 6 0 2015-12-28 2016-01-01
# 7 A 7 0 2016-01-02 2016-01-06
# 8 A 8 0 2016-01-07 2016-01-11
# 9 A 9 0 2016-01-12 2016-01-16
# 10 A 10 0 2016-01-17 2016-01-21
# .. ... .. ... ... ...
First row of the output tells you that for product A in the first 4 days period (id = 1) you had 4 quantities in total and the period is from 3/12 to 7/12.
I would suggest {dplyr}'s summarise(),mutate() and group_by() functions. group_by() groups your data by desired variables (in your case - product and delivery term),mutate() allows operations on grouped columns, and summarise() applies a summarising function over these groups (in your case sum(Quantity)).
So this is how it will look:
convert date into proper format:
library(dplyr)
df=tbl_df(df)
df$Date=as.Date(df$Date,format="%d/%m/%y")
calculating delivery terms
df=group_by(df,Product) %>% arrange(Date)
df=mutate(df,term=1+unclass((Date-min(Date)))%/%4)
group by product and terms and calculate sum of quantity:
df=group_by(df,Product,term)
summarise(df,sum=sum(Quantity))
Here's a base R way:
df$groups <- ave(as.numeric(df$Date), df$Product, FUN=function(x) {
intrvl <- findInterval(x, seq(min(x), max(x),4))
as.numeric(factor(intrvl))
})
df
# Product Quantity Date groups
# 1 A 1 2015-12-03 1
# 2 A 2 2015-12-04 1
# 3 A 1 2015-12-05 1
# 4 A 1 2015-12-08 2
# 5 A 1 2016-12-17 3
# 6 A 1 2016-12-18 3
# 7 B 1 2015-12-19 2
# 8 B 2 2015-05-10 1
# 9 B 2 2015-05-11 1
# 10 C 1 2015-06-01 1
# 11 C 1 2015-06-02 1
# 12 C 1 2015-06-12 2
The dates should be converted to one of the date classes. I chose as.Date. When it converts to numeric, the output will be the number of days from a specified date. From there, we are able to group by 4 day increments.
Data
df$Date <- as.Date(df$Date, format="%d/%m/%y")

Resources