summing based on conditions from two dataframes and dealing with dates - r

I have two dataframes, one with climate data for every location and date across 4 years. The other data frame has a date for each day an animal was trapped at a site. I am trying to calculate the mean of each climate variable based on a specific amount of time before the day the animal was trapped (time length depends on variable in question).
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
> climate
site precip humid date
1 1 0.1 1 6/13/2020
2 1 0.2 1 6/12/2020
3 1 0.1 3 6/11/2020
4 1 0.1 1 6/14/2020
5 2 0.5 2 6/13/2020
6 2 0.2 3 6/12/2020
7 2 0.3 3 6/11/2020
8 2 0.1 1 6/14/2020
9 1 0.2 1 2/13/2019
10 1 0.1 3 2/14/2019
11 1 0.1 1 2/15/2019
12 1 0.5 2 2/16/2019
> trap
site date
1 1 7/1/2020
2 2 7/1/2020
3 3 7/2/2020
4 3 7/4/2020
I want to calculate the mean humid 18-20 days before the date written in the trap dataframe. So essentially what is the mean humid between 6/11/2020 and 6/13/2020 according to the climate data.frame for animals trapped on 7/1/2020. So for site 1 that would be: 1.667 and site 2 that would be 2.67.
I also want to calculate the sum of precipitation 497-500 days before the date written in the trap dataframe. So I would need to calculate the sum (total) precip between 2/13/2019 and 2/16/2019 for an animal trapped on 7/1/2020 at each site. So for site 1 precip would be 0.9.
I know how to create new columns in the trap data frame for mean precip and sum humid but I'm not sure where to start in terms of coding so that each value is calculated as described above and the data that corresponds to the correct date is used for the large dataset that contains many different trap dates.
Thank you very much, hopefully I am being clear in my description.

I have a solution using functions from the tidyverse. It is always useful to convert date variables to the class date. With this class, you can make calculations. Note, that I renamed the date column in the trap data to trap_date. See comments for more details:
library(tidyverse)
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), trap_date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
# merge data
data <- merge(climate, trap, by="site")
> head(data)
site precip humid date trap_date
1 1 0.1 1 2020-06-13 2020-07-01
2 1 0.2 1 2020-06-12 2020-07-01
3 1 0.1 3 2020-06-11 2020-07-01
4 1 0.1 1 2020-06-14 2020-07-01
5 1 0.2 1 2019-02-13 2020-07-01
6 1 0.1 3 2019-02-14 2020-07-01
# parse dates to class 'date'; enables calculations
data <- data %>%
mutate(date = parse_date(date, format="%m/%d/%Y"),
trap_date = parse_date(trap_date, format="%m/%d/%Y"))
For means:
# humid means
data %>%
group_by(site) %>%
filter(date >= trap_date-20 & date <= trap_date-18) %>%
summarise(mean = mean(humid))
# A tibble: 2 x 2
site mean
<dbl> <dbl>
1 1 1.67
2 2 2.67
However, it seems that the range of 497 to 500 days before the trap date contains no observations. When I used your specified dates, I got the same result of 0.9:
# precip sums
data %>%
group_by(site) %>%
filter(date >= trap_date-500 & date <= trap_date-497)
# A tibble: 0 x 5
# Groups: site [0]
# ... with 5 variables: site <dbl>, precip <dbl>, humid <dbl>,
# date <date>, trap_date <date>
# using your provided dates
data %>%
group_by(site) %>%
filter(date >= as.Date("2019-02-13") & date <= as.Date("2019-02-16")) %>%
summarise(sum = sum(precip))
# A tibble: 1 x 2
site sum
<dbl> <dbl>
1 1 0.9
Hope I can help.

Related

Selecting rows that fulfills the criteria of having other rows within a given difference of a variable in the same group

I've got a dataset of many individuals("ID") with body weight measurement ("BW")at random time points("time") spanning over 15 years.
Example:
ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
ID Time BW
1 1 2015/1/1 75.01736
2 1 2015/3/1 75.44717
3 1 2016/1/1 73.09934
4 1 2016/3/1 74.79920
5 1 2017/1/1 74.70097
6 1 2018/5/1 74.23496
7 2 2012/1/1 73.57179
8 2 2017/5/1 74.50970
9 2 2019/4/1 74.43412
10 2 2020/4/1 75.02952
11 3 2019/10/1 76.41390
12 3 2020/1/1 75.79827
13 3 2020/4/1 74.46035
What I'm trying to filter are IDs with measurements that has one within 12+/- 3 months prior to this measurement and one after. ie. one bodyweight at 0yr+/-3months one at 1yr one at 2yr+/-3months. In this case, only rows 3 to 5 fulfill the criteria.
And in all "individuals" that fulfills such criteria, I would like choose the measurement that has the most data points within this +/- 15 months range. The example desired output may look like:
ID Time BW Fulfill Counts
1 1 2015/1/1 75.01736 0 4
2 1 2015/3/1 75.44717 0 4
3 1 2016/1/1 73.09934 1 5
4 1 2016/3/1 74.79920 1 5
5 1 2017/1/1 74.70097 1 3
6 1 2018/5/1 74.23496 0 2
7 2 2012/1/1 73.57179 0 1
8 2 2017/5/1 74.50970 0 1
9 2 2019/4/1 74.43412 0 2
10 2 2020/4/1 75.02952 0 2
11 3 2019/10/1 76.41390 0 3
12 3 2020/1/1 75.79827 0 3
13 3 2020/4/1 74.46035 0 3
I've tried my best searching for similar answers on internet but I couldn't come up with anything remotely close to what I want to do. I could only make it to the grouping part with
group_by(ID)%>%
mutate(Fulfill==if time-...)
and then stuck the "calculating difference with every other row" thing. I'm imagining something like a loop for each row within a group(ID) to calculate the difference in time and then a logical statement for determining whether it's true or not. I've used R for a while but only with descriptive statistics previously, so I'm sorry if it's actually quite simple. Thanks.
Here is a tidyverse approach (not completely optimised, you could probably even simplify it to only one function call with map_dfr or so).
I've chosen to use purrr::map_ functions. This allows me to apply the function to every entry of the column/vector separately (this results from the first time Time is passed to map_) and at the same time also pass the complete Time column (the second argument), to calculate the filter operations to see if you have entries in the +-15 months.
ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
library(dplyr)
library(purrr)
library(lubridate)
check_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound_1 <- curr_entry %m-% months(15)
lower_bound_2 <- curr_entry %m-% months(9)
upper_bound_1 <- curr_entry %m+% months(9)
upper_bound_2 <- curr_entry %m+% months(15)
# filter the entries that match the time period constraints
entries <- data.frame(entries = entries)
filtered_lower <- entries %>%
filter(entries >= lower_bound_1 & entries <= lower_bound_2)
filtered_upper <- entries %>%
filter(entries >= upper_bound_1 & entries <= upper_bound_2)
# check if there is a matching earlier and later entry
if (nrow(filtered_lower) > 0 && nrow(filtered_upper) > 0) {
TRUE
} else {
FALSE
}
}
calculate_number_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound <- curr_entry %m-% months(15)
upper_bound <- curr_entry %m+% months(15)
# filter the matching entries and calculate the number of observations
entries <- data.frame(entries = entries)
entries %>%
filter(entries >= lower_bound & entries <= upper_bound) %>%
nrow()
}
df %>%
group_by(ID) %>%
mutate(Time = as.Date(Time, format = "%Y/%m/%d"),
Fulfill = map_lgl(Time, check_entries, Time),
Fulfill_ID = sum(Fulfill) > 0,
Counts = map_int(Time, calculate_number_entries, Time))
#> # A tibble: 13 x 6
#> # Groups: ID [3]
#> ID Time BW Fulfill Fulfill_ID Counts
#> <chr> <date> <dbl> <lgl> <lgl> <int>
#> 1 1 2015-01-01 75.4 FALSE TRUE 4
#> 2 1 2015-03-01 74.0 FALSE TRUE 4
#> 3 1 2016-01-01 74.2 TRUE TRUE 5
#> 4 1 2016-03-01 74.9 TRUE TRUE 5
#> 5 1 2017-01-01 75.6 FALSE TRUE 3
#> 6 1 2018-05-01 73.8 FALSE TRUE 1
#> 7 2 2012-01-01 75.6 FALSE FALSE 1
#> 8 2 2017-05-01 75.0 FALSE FALSE 1
#> 9 2 2019-04-01 74.3 FALSE FALSE 2
#> 10 2 2020-04-01 74.9 FALSE FALSE 2
#> 11 3 2019-10-01 75.5 FALSE FALSE 3
#> 12 3 2020-01-01 75.3 FALSE FALSE 3
#> 13 3 2020-04-01 76.0 FALSE FALSE 3
Created on 2020-12-06 by the reprex package (v0.3.0)
Note that I find a different result for the 5th entry, you may check if the month addition/subtraction is as you need it, check out lubridate for more info.

R Expand time series data based on start and end point

I think I have a pretty simple request. I have the following dataframe, where "place" is a unique identifier, while start_date and end_date may overlap. The values are unique for each ID "place".
place start_date end_date value
1 2007-09-01 2010-10-12 0.5
2 2013-09-27 2015-10-11 0.7
...
What I need is to create a year-based variable, where I expand the time series by each year (starting from first of January (i.e. 2011-01-01) starts a new row for that particular "place" and "value". I mean something like this:
place year value
1 2007 0.5
1 2008 0.5
1 2009 0.5
1 2010 0.5
2 2013 0.7
2 2014 0.7
2 2015 0.7
...
There are some cases with overlap (ie. "place"=1 & "year"=2007) for two separate cases, where one observations starts with one year and the other observation continues from that year. In that case I would prefer the "value" that ends on that specific year. So if one observation for place=1 ends with 2007 in March and another place=1 starts with 2007 in April, year=2007 value for place=1 would be marked with the previous "ending" value if that makes sense.
I've only gotten this far:
library(data.table)
data <- data.table(dat)
data[,:=(start_date = as.Date(start_date), end_date = as.Date(end_date))]
data[,num_mons:= length(seq(from=start_date, to=end_date, by='year')),by=1:nrow(data)]
I guess writing a loop makes the most sense?
Thank you for your help and advice.
Using a tidyverse solution could look like:
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
data <- tibble(place = c(1, 2),
start_date = c('2007-09-01',
'2013-09-27'),
end_date = c('2010-10-12',
'2015-10-11'),
value = c(0.5, 0.7))
data %>%
mutate(year = map2(start_date,
end_date,
~ as.character(str_extract(.x, '\\d{4}'):
str_extract(.y, '\\d{4}')))) %>%
separate_rows(year) %>%
filter(!year %in% c('c', '')) %>%
select(place, year, value)
# place year value
# <dbl> <chr> <dbl>
# 1 1 2007 0.5
# 2 1 2008 0.5
# 3 1 2009 0.5
# 4 1 2010 0.5
# 5 2 2013 0.7
# 6 2 2014 0.7
# 7 2 2015 0.7
I'm having problems understanding the third paragraph of your question ("There are ..."). It seems to me to be a separate question. If that is the case, please consider moving the question to a separate post here on SO. If it is not a separate question, please reformulate the paragraph.
You could do the following:
library(lubridate)
library(tidyverse)
df %>%
group_by(place) %>%
mutate(year = list(seq(year(ymd(start_date)), year(ymd(end_date)))))%>%
unnest(year)%>%
select(place,year,value)
# A tibble: 7 x 3
# Groups: place [2]
place year value
<int> <int> <dbl>
1 1 2007 0.5
2 1 2008 0.5
3 1 2009 0.5
4 1 2010 0.5
5 2 2013 0.7
6 2 2014 0.7
7 2 2015 0.7

Time difference calculated from wide data with missing rows

There is a longitudinal data set in the wide format, from which I want to compute time (in years and days) between the first observation date and the last date an individual was observed. Dates are in the format yyyy-mm-dd. The data set has four observation periods with missing dates, an example is as follows
df1<-data.frame("id"=c(1:4),
"adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
"bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
"cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
"ddate"=c("2014-06-15",NA,"2014-04-11",NA))
Here "adate" is the first date and the last date is the date an individual was last seen. To compute the time difference (lastdate-adate), I have tried using "lubridate" package, for example
lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years")
However, I'm challenged by the fact that the last date is not coming from one column. I'm looking for a way to automate the calculation in R. The expected output would look like
id years days
1 1 2.99 1093
2 2 2.00 731
3 3 3.01 1098
4 4 1.01 369
Years is approximated to 2 decimal places.
Another tidyverse solution can be done by converting the data to long format, removing NA dates, and getting the time difference between last and first date for each id.
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
pivot_longer(-id) %>%
na.omit %>%
group_by(id) %>%
mutate(value = as.Date(value)) %>%
summarise(years = time_length(difftime(last(value), first(value)),"years"),
days = as.numeric(difftime(last(value), first(value))))
#> # A tibble: 4 x 3
#> id years days
#> <int> <dbl> <dbl>
#> 1 1 2.99 1093
#> 2 2 2.00 731
#> 3 3 3.01 1098
#> 4 4 1.01 369
We could use pmap
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
mutate(out = pmap(.[-1], ~ {
dates <- as.Date(na.omit(c(...)))
tibble(years = lubridate::time_length(difftime(last(dates),
first(dates)), "years"),
days = lubridate::time_length(difftime(last(dates), first(dates)), "days"))
})) %>%
unnest_wider(out)
# A tibble: 4 x 7
# id adate bdate cdate ddate years days
# <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#1 1 2011-06-18 2012-06-15 2013-06-18 2014-06-15 2.99 1093
#2 2 2011-06-18 2012-06-15 2013-06-18 <NA> 2.00 731
#3 3 2011-04-09 <NA> 2013-04-09 2014-04-11 3.01 1098
#4 4 2011-05-20 2012-05-23 <NA> <NA> 1.01 369
Probably most of the functions introduced here might be quite complex. You should try to learn them if possible. Although will provide a Base R approach:
grp <- droplevels(interaction(df[,1],row(df[-1]))) # Create a grouping:
days <- tapply(unlist(df[-1]),grp, function(x)max(x,na.rm = TRUE) - x[1]) #Get the difference
cbind(df[1],days, years = round(days/365,2)) # Create your table
id days years
1.1 1 1093 2.99
2.2 2 731 2.00
3.3 3 1098 3.01
4.4 4 369 1.01
if comfortable with other higher functions then you could do:
dat <- aggregate(adate~id,reshape(df1,list(2:ncol(df1)), dir="long"),function(x)max(x) - x[1])
transform(dat,year = round(adate/365,2))
id adate year
1 1 1093 2.99
2 2 731 2.00
3 3 1098 3.01
4 4 369 1.01
Using base R apply :
df1[-1] <- lapply(df1[-1], as.Date)
df1[c('years', 'days')] <- t(apply(df1[-1], 1, function(x) {
x <- na.omit(x)
x1 <- difftime(x[length(x)], x[1], 'days')
c(x1/365, x1)
}))
df1[c('id', 'years', 'days')]
# id years days
#1 1 2.994521 1093
#2 2 2.002740 731
#3 3 3.008219 1098
#4 4 1.010959 369

Assign day of the day year to a month

Sample data
df <- data.frame(ID1 = rep(1:1000, each= 5*365), year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5))
This data contains a column day which is the day of the year. I need to produce two columns:
Month column: a column of month (which month does the day belong)
Biweek column: which biweek does a day belong to. There are 24 biweek in a year. All days <= 15 in a month is the first biweek and > 15 is second biweek.
For e.g.
15th Jan is Biweek 1,
16-31 Jan is biweek 2,
1-15 Feb is biweek 3 and
16-28 Feb is biweek 4 and so on.
For sake of simplicity, I am assuming all the years are non-leap years.
Here's the code I have (with help from RS as well) that creates the two columns.
# create a vector of days for each month
months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)
library(dplyr)
ptm <- proc.time()
df <- df %>% mutate(month = sapply(day, function(x) which(sapply(months, function(y) x %in% y))), # this assigns each day to a month
date = as.Date(paste0(year,'-',format(strptime(paste0('1981-',day), '%Y-%j'), '%m-%d'))), # this creates a vector of dates for a non-leap year
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
user system elapsed
121.71 0.31 122.43
My issue is that of the time it takes to run this script and I am looking for a solution that is relatively faster
EDIT: To be clear, I have assumed all years must have 365 days. In one of the answers below, for the year 2000 (a leap year), Feb has 29 days (last day of Feb is 60 but I want the last day to be 59) and therefore Dec has only 30 days (Dec start with 336 though it should start with 335). I hope this is clear. My solution addresses this issue but takes lot of time to run.
Here is a solution using lubridate extractors and replacement functions as mentioned by Frank in a comment. The key ones are yday<-, mday() and month(), which respectively set the day of year of a date, get the day of month of a date, and get the month of a date. 8 sec running time seems pretty acceptable to me, though I'm sure some optimising could shave that down though there might be a loss of generality.
Note also the use of case_when to ensure the correct numbering of days after Feb 29 on a leap year.
EDIT: Here is a significantly faster solution. You can just get the mapping of DOYs to months and biweeks for a single year, and then left_join to the main table. 0.36s running time, since you no longer have to repetitively create the date. We also bypass having to use case_when, since the join will take care of the missing days. See that Day 59 of year 2000 is February and Day 60 is March, as requested.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
tbl <- tibble(
ID1 = rep(1:1000, each= 5*365),
year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5)
)
tictoc::tic("")
doys <- tibble(
day = rep(1:365),
date = seq.Date(ymd("2001-1-1"), ymd("2001-12-31"), by = 1),
month = month(date),
biweek = case_when(
mday(date) <= 15 ~ (month * 2) - 1,
mday(date) > 15 ~ month * 2
)
)
tbl_out2 <- left_join(tbl, select(doys, -date), by = "day")
tictoc::toc()
#> : 0.36 sec elapsed
tbl_out2
#> # A tibble: 1,825,000 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 1 0.331 1. 1.
#> 2 1 2000 2 0.284 1. 1.
#> 3 1 2000 3 0.627 1. 1.
#> 4 1 2000 4 0.762 1. 1.
#> 5 1 2000 5 0.460 1. 1.
#> 6 1 2000 6 0.500 1. 1.
#> 7 1 2000 7 0.340 1. 1.
#> 8 1 2000 8 0.952 1. 1.
#> 9 1 2000 9 0.663 1. 1.
#> 10 1 2000 10 0.385 1. 1.
#> # ... with 1,824,990 more rows
tbl_out2[55:65, ]
#> # A tibble: 11 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 55 0.127 2. 4.
#> 2 1 2000 56 0.779 2. 4.
#> 3 1 2000 57 0.625 2. 4.
#> 4 1 2000 58 0.245 2. 4.
#> 5 1 2000 59 0.640 2. 4.
#> 6 1 2000 60 0.423 3. 5.
#> 7 1 2000 61 0.439 3. 5.
#> 8 1 2000 62 0.105 3. 5.
#> 9 1 2000 63 0.218 3. 5.
#> 10 1 2000 64 0.668 3. 5.
#> 11 1 2000 65 0.589 3. 5.
Created on 2018-04-06 by the reprex package (v0.2.0).
You can speed this up almost an order of magnitude by defining date first, reducing redundancy in the date call, and then extracting month from date.
ptm <- proc.time()
df <- df %>% mutate(
date = as.Date(paste0(year, "-", day), format = "%Y-%j"), # this creates a vector of dates
month = as.numeric(format(date, "%m")), # extract month
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
# user system elapsed
# 18.58 0.13 18.75
Versus original version in the question
# user system elapsed
# 117.67 0.15 118.45
Filtered for one year. I think it solves the leap issue you described, unless I'm not clear on what you're saying. Last day of Feb is 59 in the df in my result below, but only because day is 0 indexed.
df2000 <- filter(df, year == "2000")
ptm <- proc.time()
df2000 <- df2000 %>% mutate(
day = day - 1, # dates are 0 indexed
date = as.Date(day, origin = "2000-01-01"),
month = as.numeric(as.POSIXlt(date, format = "%Y-%m-%d")$mon + 1),
bis = month * 2 - (as.numeric(format(date, "%d")) <= 15)
)
proc.time() - ptm
user system elapsed
0.8 0.0 0.8
One year is 0.2 of the whole df, so times reflect that.

Creating a vector containing total quantities sold per delivery term

Have a look at the simplified table below. I want for each product a vector containing the quantities sold within each delivery time. A delivery time is defined as 4 days. So if we look at product A, we see that it starts at 03/12/15 and within the first delivery term (until 07/12/15) it has sold a quantity of 4. The second delivery term starts at 08/12/15 and ends at 12/12/15. So for this period there is 1 quantity sold. The following delivery term starts at 13/12/15 and ends at 17/12/15. During these period there are no quantities sold and thus for this period the vector must have a value of 0. In the last period, finally, 2 products are sold. So basically the problem here is that information regarding the periods were no products are sold is missing.
Any ideas on how the vector I want can be created using R? I've been thinking of for or while loops, but these do not seem to give the requested results. Note that the code must be applicable on a real dataset containing over 1000 product categories, so it has to be 'automatized' in one way.
I would be very gratefull if somebody could point me in the right direction.
Product Quantity Date
A 1 03/12/15
A 2 04/12/15
A 1 05/12/15
A 1 08/12/15
A 1 17/12/16
A 1 18/12/16
B 1 19/12/15
B 2 10/05/15
B 2 11/05/15
C 1 01/06/15
C 1 02/06/15
C 1 12/06/15
Assume that dt is the dataset you provided. You'll get a better understanding of the process if you run it step by step (and maybe with an even simpler dataset).
library(lubridate)
library(dplyr)
# create date time columns
dt$Date = dmy(dt$Date)
dt %>%
group_by(Product) %>%
do(data.frame(days = seq(min(.$Date), max(.$Date), by="1 day"))) %>% # create all combinations between product and days
mutate(dist = as.numeric(difftime(days,min(days), units="days"))) %>% # create distance of each day with min date
ungroup() %>%
left_join(dt, by=c("Product"="Product","days"="Date")) %>% # join info to get quantities for each day
mutate(Quantity = ifelse(is.na(Quantity), 0, Quantity), # replace NAs with 0s
id = floor(dist/5 + 1)) %>% # create the 4 period id
group_by(Product, id) %>%
summarise(Sum = sum(Quantity),
min_date = min(days),
max_date = max(days)) %>%
ungroup
# Product id Sum min_date max_date
# 1 A 1 4 2015-12-03 2015-12-07
# 2 A 2 1 2015-12-08 2015-12-12
# 3 A 3 0 2015-12-13 2015-12-17
# 4 A 4 0 2015-12-18 2015-12-22
# 5 A 5 0 2015-12-23 2015-12-27
# 6 A 6 0 2015-12-28 2016-01-01
# 7 A 7 0 2016-01-02 2016-01-06
# 8 A 8 0 2016-01-07 2016-01-11
# 9 A 9 0 2016-01-12 2016-01-16
# 10 A 10 0 2016-01-17 2016-01-21
# .. ... .. ... ... ...
First row of the output tells you that for product A in the first 4 days period (id = 1) you had 4 quantities in total and the period is from 3/12 to 7/12.
I would suggest {dplyr}'s summarise(),mutate() and group_by() functions. group_by() groups your data by desired variables (in your case - product and delivery term),mutate() allows operations on grouped columns, and summarise() applies a summarising function over these groups (in your case sum(Quantity)).
So this is how it will look:
convert date into proper format:
library(dplyr)
df=tbl_df(df)
df$Date=as.Date(df$Date,format="%d/%m/%y")
calculating delivery terms
df=group_by(df,Product) %>% arrange(Date)
df=mutate(df,term=1+unclass((Date-min(Date)))%/%4)
group by product and terms and calculate sum of quantity:
df=group_by(df,Product,term)
summarise(df,sum=sum(Quantity))
Here's a base R way:
df$groups <- ave(as.numeric(df$Date), df$Product, FUN=function(x) {
intrvl <- findInterval(x, seq(min(x), max(x),4))
as.numeric(factor(intrvl))
})
df
# Product Quantity Date groups
# 1 A 1 2015-12-03 1
# 2 A 2 2015-12-04 1
# 3 A 1 2015-12-05 1
# 4 A 1 2015-12-08 2
# 5 A 1 2016-12-17 3
# 6 A 1 2016-12-18 3
# 7 B 1 2015-12-19 2
# 8 B 2 2015-05-10 1
# 9 B 2 2015-05-11 1
# 10 C 1 2015-06-01 1
# 11 C 1 2015-06-02 1
# 12 C 1 2015-06-12 2
The dates should be converted to one of the date classes. I chose as.Date. When it converts to numeric, the output will be the number of days from a specified date. From there, we are able to group by 4 day increments.
Data
df$Date <- as.Date(df$Date, format="%d/%m/%y")

Resources