New to R and learning lot and I love it.
My sales dataset has product, bpid, date. I would like to calculate the business difference between same bpid (just excluding Saturday & Sunday) .
If product or bpid changes (Or new bpid/product introduced), we need to skip the computations.
df <- data.frame(product=c('milk','milk','milk','milk','eggs','eggs','eggs','eggs'),
bpid=c(400,400,500,500,400,400,500,500),
date=c("2016-08-03","2016-08-10","2016-08-04","2016-08-10","2016-08-10","2016-08-16","2016-08-11","2016-08-15"));
df$date <- as.Date(df$date, format = "%Y-%m-%d");
My Desired result would be like below. Please help....
product bpid date compute-result
milk 400 2016-08-03 0
milk 400 2016-08-10 5
milk 500 2016-08-04 0
milk 500 2016-08-10 5
eggs 400 2016-08-10 0
eggs 400 2016-08-16 4
eggs 500 2016-08-11 0
eggs 500 2016-08-15 2
Real Data code (getting zeros at result column)
df <- data.frame(product=c('Keyt','Keyt','Keyt','Keyt','Keyt','Keyt'),
bpid=c(30057,30057,30057,30058,30058,30058),
date=c("2014-11-21","2015-05-05","2015-05-11","2014-10-16","2014-11-03","2016-03-15"));
df$date <- as.Date(df$date, format = "%Y-%m-%d");
cal <- Calendar(weekdays=c("saturday", "sunday"))
df$`compute-result` <- 0
idx <- seq(1, nrow(df),2)
df$`compute-result`[idx+1] <- bizdays(df$date[idx], df$date[idx+1], cal)
df
For example:
# install.packages("bizdays")
library(bizdays)
cal <- create.calendar(name = "mycal", weekdays=c("saturday", "sunday"))
df$`compute-result` <- 0
idx <- seq(1, nrow(df),2)
df$`compute-result`[idx+1] <- bizdays(df$date[idx], df$date[idx+1], cal)
df
# product bpid date compute-result
# 1 milk 400 2016-08-03 0
# 2 milk 400 2016-08-10 5
# 3 milk 500 2016-08-04 0
# 4 milk 500 2016-08-10 4
# 5 eggs 400 2016-08-10 0
# 6 eggs 400 2016-08-16 4
# 7 eggs 500 2016-08-11 0
# 8 eggs 500 2016-08-15 2
if you want to group by product and bpid, you could try
# install.packages("bizdays")
library(bizdays)
cal <- create.calendar(name="mycal", weekdays=c("saturday", "sunday"))
with(df, ave(as.integer(date), product, bpid, FUN=function(x) {
x <- as.Date(x, origin="1970-01-01")
c(0, bizdays(head(x, -1), tail(x, -1), cal))
})) -> df$result
df
# product bpid date result
# 1 Keyt 30057 2014-11-21 0
# 2 Keyt 30057 2015-05-05 117
# 3 Keyt 30057 2015-05-11 4
# 4 Keyt 30058 2014-10-16 0
# 5 Keyt 30058 2014-11-03 12
# 6 Keyt 30058 2016-03-15 356
Note that is converted date to integer and back to Date inside the function because otherwise ave throws an error:
Error in as.Date.numeric(value) : 'origin' must be supplied
and I dunno how to supply that origin argument here.
Related
I've this function to generate monthly ranges, it should consider years where february has 28 or 29 days:
starts ends
1 2017-01-01 2017-01-31
2 2017-02-01 2017-02-28
3 2017-03-01 2017-03-31
It works with:
make_date_ranges(as.Date("2017-01-01"), Sys.Date())
But gives error with:
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Why?
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Error in data.frame(starts, ends) :
arguments imply differing number of rows: 38, 36
add_months <- function(date, n){
seq(date, by = paste (n, "months"), length = 2)[2]
}
make_date_ranges <- function(start, end){
starts <- seq(from = start,
to = Sys.Date()-1 ,
by = "1 month")
ends <- c((seq(from = add_months(start, 1),
to = end,
by = "1 month" ))-1,
(Sys.Date()-1))
data.frame(starts,ends)
}
## useage
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
1) First, define start of month, som, and end of month, eom functions which take a Date class object, date string in standard Date format or yearmon object and produce a Date class object giving the start or end of its year/months.
Using those, create a monthly Date series s using the start of each month from the month/year of from to that of to. Use pmax to ensure that the series does not extend before from and pmin so that it does not extend past to.
The input arguments can be strings in standard Date format, Date class objects or yearmon class objects. In the yearmon case it assumes the user wanted the full month for every month. (The if statement can be omitted if you don't need to support yearmon inputs.)
library(zoo)
som <- function(x) as.Date(as.yearmon(x))
eom <- function(x) as.Date(as.yearmon(x), frac = 1)
date_ranges2 <- function(from, to) {
if (inherits(to, "yearmon")) to <- eom(to)
s <- seq(som(from), eom(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges2("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges2(as.yearmon("2000-01"), as.yearmon("2000-06"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
2) This alternative takes the same approach but defines start of month (som) and end of month (eom) functions without using yearmon so that only base R is needed. It takes character strings in standard Date format or Date class inputs and gives the same output as (1).
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
date_ranges3 <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges3("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges3(som("2000-01-10"), eom("2000-06-20"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
You don't need to use seq twice -- you can subtract 1 day from the firsts of each month to get the ends, and generate one too many starts, then shift & subset:
make_date_ranges = function(start, end) {
# format(end, "%Y-%m-01") essentially truncates end to
# the first day of end's month; 32 days later is guaranteed to be
# in the subsequent month
starts = seq(from = start, to = as.Date(format(end, '%Y-%m-01')) + 32, by = 'month')
data.frame(starts = head(starts, -1L), ends = tail(starts - 1, -1L))
}
x = make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
rbind(head(x), tail(x))
# starts ends
# 1 2017-01-01 2017-01-31
# 2 2017-02-01 2017-02-28
# 3 2017-03-01 2017-03-31
# 4 2017-04-01 2017-04-30
# 5 2017-05-01 2017-05-31
# 6 2017-06-01 2017-06-30
# 31 2019-07-01 2019-07-31
# 32 2019-08-01 2019-08-31
# 33 2019-09-01 2019-09-30
# 34 2019-10-01 2019-10-31
# 35 2019-11-01 2019-11-30
# 36 2019-12-01 2019-12-31
I have a large dataset that includes date periods with different disease states per id and reference date. I would like to add a 'healthy' state for all missing date periods within +/- 5 years from the reference date per id.
I have tried to modify the solution here: Fill in missing date ranges but failed. Preferably, I would like to keep to the data.table framework. Any advice is greatly appreciated!
Sample data:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
Desired output:
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2005-03-16 2008-10-10 0
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2008-10-13 2014-11-04 0
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2008-05-10 2012-01-31 0
2 2013-05-10 2012-02-01 2012-03-01 2
2 2013-05-10 2012-03-02 2018-05-10 0
2 2014-06-11 2009-06-11 2012-01-31 0
2 2014-06-11 2012-02-01 2012-03-01 2
2 2014-06-11 2012-03-02 2019-06-11 0
3 2011-08-14 2006-08-14 2016-08-14 0
")
Comment:
For the first row, the +/-5 year date interval is from 2005-01-10 to 2015-01-10. However, because of the ongoing disease state that ends 2005-03-15, the "healthy" period starts at 2005-03-16. Because there can be several reference dates per id, duplicate date periods (as observed for id 2: 2012-02-01-2012-03-01) will be present and are OK. Finally, ids with no disease states are represented by NA (as id 3).
EDIT: I had some problems with the real data, so I tweaked the solution a bit; also added so that the status is collapsed per date interval:
DT2 <- DT[,{
# +/-5 years from t0
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(start[1L])) {
# replace NA with full time interval for 'healthy'
.(period_start=sdt, period_end=edt, status='notsick')
} else{
# Add date for -5 years if it is the minimum, otherwise use existing minimum
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
# Add date for +5 years if it is the maximum, otherwise use existing maximum
if (edt > period_end[.N]) {
period_end <- c(period_end,edt)
}
dates=unique(sort(c(period_start, period_end+1L)))
.(start=dates[-length(dates)],end=dates[-1L]-1,status='')
}
},
.(id,reference_date)]
## (c). Collapse status for overlapping periods
DT <- DT[DT2, on = .(id,reference_date, period_start <= period_start, period_end >= period_end), {
status <- paste(status, collapse = ";")
.(status=status)},
by = .EACHI, allow.cartesian = TRUE]
here is an option:
interweave <- function(x, y) c(rbind(x, y)) #see ref
ans <- DT[, {
sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
if(is.na(period_start[1L])) {
.(period_start=sdt, period_end=edt, Status=0L)
} else {
if (sdt < period_start[1L]) {
period_start <- c(sdt, period_start)
}
ps <- as.IDate(sort(interweave(period_start, period_end+1L)))
if (period_end[.N] > edt) {
ps <- ps[-length(ps)]
pe <- period_end[.N]
} else {
pe <- edt
}
.(period_start=ps, period_end=c(ps[-1L] - 1, pe), Status=0L)
}
},
.(id, reference_date)]
ans[DT, on=setdiff(names(DT), "Status"), Status := i.Status]
ans
data:
library(data.table)
DT <- fread("
id reference_date period_start period_end Status
1 2010-01-10 2004-06-22 2005-03-15 1
1 2010-01-10 2008-10-11 2008-10-12 1
1 2010-01-10 2014-11-05 2016-01-03 2
2 2013-05-10 2012-02-01 2012-03-01 2
2 2014-06-11 2012-02-01 2012-03-01 2
3 2011-08-14 NA NA NA
")
cols <- c("reference_date","period_start","period_end")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
Reference:
Alternate, interweave or interlace two vectors
I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1
I have some data. ID and date and I'm trying to create a new field for semester.
df:
id date
1 20160822
2 20170109
3 20170828
4 20170925
5 20180108
6 20180402
7 20160711
8 20150831
9 20160111
10 20160502
11 20160829
12 20170109
13 20170501
I also have a semester table:
start end season_year
20120801 20121222 Fall-2012
20121223 20130123 Winter-2013
20130124 20130523 Spring-2013
20130524 20130805 Summer-2013
20130806 20131228 Fall-2013
20131229 20140122 Winter-2014
20140123 20140522 Spring-2014
20140523 20140804 Summer-2014
20140805 20141227 Fall-2014
20141228 20150128 Winter-2015
20150129 20150528 Spring-2015
20150529 20150803 Summer-2015
20150804 20151226 Fall-2015
20151227 20160127 Winter-2016
20160128 20160526 Spring-2016
20160527 20160801 Summer-2016
20160802 20161224 Fall-2016
20161225 20170125 Winter-2017
20170126 20170525 Spring-2017
20170526 20170807 Summer-2017
20170808 20171230 Fall-2017
20171231 20180124 Winter-2018
20180125 20180524 Spring-2018
20180525 20180806 Summer-2018
20180807 20181222 Fall-2018
20181223 20190123 Winter-2019
20190124 20190523 Spring-2019
20190524 20180804 Summer-2019
I'd like to create a new field in df if df$date is between semester$start and semester$end, then place the respective value semester$season_year in df
I tried to see if the lubridate package could help but that seems to be more for calculations
I saw this question and it seems to be the closest to what i want, but, to make things more complicated, not all of our semesters are six months
Does this work?
library(lubridate)
semester$start <- ymd(semester$start)
semester$end <- ymd(semester$end)
df$date <- ymd(df$date)
LU <- Map(`:`, semester$start, semester$end)
LU <- data.frame(value = unlist(LU),
index = rep(seq_along(LU), lapply(LU, length)))
df$semester <- semester$season_year[LU$index[match(df$date, LU$value)]]
A solution using non-equi update joins using data.table and lubridate package can be as:
library(data.table)
setDT(df)
setDT(semester)
df[,date:=as.IDate(as.character(date), format = "%Y%m%d")]
semester[,':='(start = as.IDate(as.character(start), format = "%Y%m%d"),
end=as.IDate(as.character(end), format = "%Y%m%d"))]
df[semester, on=.(date >= start, date <= end), season_year := i.season_year]
df
# id date season_year
# 1: 1 2016-08-22 Fall-2016
# 2: 2 2017-01-09 Winter-2017
# 3: 3 2017-08-28 Fall-2017
# 4: 4 2017-09-25 Fall-2017
# 5: 5 2018-01-08 Winter-2018
# 6: 6 2018-04-02 Spring-2018
# 7: 7 2016-07-11 Summer-2016
# 8: 8 2015-08-31 Fall-2015
# 9: 9 2016-01-11 Winter-2016
# 10: 10 2016-05-02 Spring-2016
# 11: 11 2016-08-29 Fall-2016
# 12: 12 2017-01-09 Winter-2017
# 13: 13 2017-05-01 Spring-2017
Data:
df <- read.table(text="
id date
1 20160822
2 20170109
3 20170828
4 20170925
5 20180108
6 20180402
7 20160711
8 20150831
9 20160111
10 20160502
11 20160829
12 20170109
13 20170501",
header = TRUE, stringsAsFactors = FALSE)
semester <- read.table(text="
start end season_year
20120801 20121222 Fall-2012
20121223 20130123 Winter-2013
20130124 20130523 Spring-2013
20130524 20130805 Summer-2013
20130806 20131228 Fall-2013
20131229 20140122 Winter-2014
20140123 20140522 Spring-2014
20140523 20140804 Summer-2014
20140805 20141227 Fall-2014
20141228 20150128 Winter-2015
20150129 20150528 Spring-2015
20150529 20150803 Summer-2015
20150804 20151226 Fall-2015
20151227 20160127 Winter-2016
20160128 20160526 Spring-2016
20160527 20160801 Summer-2016
20160802 20161224 Fall-2016
20161225 20170125 Winter-2017
20170126 20170525 Spring-2017
20170526 20170807 Summer-2017
20170808 20171230 Fall-2017
20171231 20180124 Winter-2018
20180125 20180524 Spring-2018
20180525 20180806 Summer-2018
20180807 20181222 Fall-2018
20181223 20190123 Winter-2019
20190124 20190523 Spring-2019
20190524 20180804 Summer-2019",
header = TRUE, stringsAsFactors = FALSE)
I have daily data for 1 year having 0 and 1 values. I want to calculate monthly events, there is consecutive 1 value for 3 on more days using R?
set.seed(123)
abts1 <- sample(0:1, 366, replace=TRUE)
library(xts)
d16 <- seq(as.Date("2016-01-01"), as.Date("2016-12-31"), 1)
ax16 <- as.Date(d16,"%y-%m-%d")
abts12 <- xts(abts1, ax16)
# but it gives events for complete period, not as monthly.
apply.monthly(abts12, function(x) sum(with(rle(c(x!=0)), lengths*values)>=3))
The last line of your code throws an error for me when I use xts_0.9-7.
R> apply.monthly(abts12, function(x) sum(with(rle(x!=0), lengths*values)>=3))
Error in rle(x != 0) : 'x' must be a vector of an atomic type
That's easy to fix though. You just need to wrap x != 0 in as.logical.
R> apply.monthly(abts12, function(x) sum(with(rle(as.logical(x!=0)), lengths*values)>=3))
[,1]
2016-01-31 2
2016-02-29 1
2016-03-31 3
2016-04-30 2
2016-05-31 1
2016-06-30 2
2016-07-31 3
2016-08-31 3
2016-09-30 2
2016-10-31 3
2016-11-30 0
2016-12-31 2
That seems like the output you expect. The number of times there are 3 or more consecutive days with a value of 1.