R check if date is 2 years apart - r

I have a dataset with two columns Id and Date as shown below using a toy dataset.
Id Date
5373283 2010-11-05
5373283 2014-11-05
5373283 2001-07-13
5373283 2007-12-01
5373283 2015-07-07
3475684 2015-05-19
3475684 2010-06-24
I want to check if any of the dates for each id are within 2 years range. If they are then a column will show yes, if not, No. The final output would look like this
Id Status
5373283 Yes
3475684 No
Yes for Id 5373283 because the two dates 2014-11-05 and 2015-07-07 are within two years of each other. No for Id 3475684 because the two dates are more than 2 years apart. Any help on accomplishing this much appreciated.

Hypothetical data.
DF <- data.frame(id = c(1, 1, 1, 2, 2),
date = c("2010-10-9", "2012-10-8", "2008-10-5",
"2007-7-5", "2009-7-5"), stringsAsFactors = FALSE)
The code below gets the minimal interval by ID in days.
What is happening is:
mutate redefines the date as Date class
arrange sort the data by date
group_by tells the following computation shall be done for each ID,
summarize computes the minimum difference.
library(dplyr)
DF %>% mutate(date = as.Date(date)) %>%
arrange(date) %>%
group_by(id) %>%
summarize(diffmin = as.numeric(min(diff(date)), units = "days"))
# id diffmin
# (dbl) (dbl)
#1 1 730
#2 2 731
If you can ignore leap years, this being smaller than or equal to 730 means within 2 years. Note that difference between 2007-7-5 and 2009-7-5 is 731 days, and thus judged as out of 2 years.
If this is not good to you, simple days-difference is not enough. I would need to define a custom checker function.
check2years <- function(a, b) {
# check if b - a <= 2 years
# assumes a and b are Date
yr_a <- format(a, "%Y") %>% as.integer()
yr_b <- format(b, "%Y") %>% as.integer()
dy_a <- format(a, "%m-%d")
dy_b <- format(b, "%m-%d")
(yr_b - yr_a < 2) | ((yr_b - yr_a == 2) & (dy_b >= dy_a))
}
Then, you can check if any combination is within 2 years by the following.
DF %>% mutate(date = as.Date(date)) %>%
arrange(date) %>%
group_by(id) %>%
summarize(within2yr = any(check2years(head(date, length(date)-1),
tail(date, length(date)-1))))
# id within2yr
# (dbl) (lgl)
#1 1 TRUE
#2 2 TRUE

You can also solve this without any library:
Using your example:
Id = c(5373283,5373283,5373283,5373283,5373283,3475684,3475684)
Date = as.Date(c("2010-11-05","2014-11-05","2001-07-13","2007-12-01","2015-07-07","2015-05-19","2010-06-24"))
df = data.frame(Id,Date)
> df
Id Date
7 3475684 2010-06-24
6 3475684 2015-05-19
3 5373283 2001-07-13
4 5373283 2007-12-01
1 5373283 2010-11-05
2 5373283 2014-11-05
5 5373283 2015-07-07
Do the following:
First order your data first by Id then by Date
df = df[order(df$Id,df$Date),]
Do an aggregate by Id using the function min(diff(x)), where x are the dates for each Id.
z = aggregate(df$Date,by = list(Id = df$Id),FUN = function(x){min(diff(x))})
What this function does is it returns the lowest difference between adjacent dates. This is why you need to order the data frame first.
This returns:
> z
Id x
1 3475684 1790
2 5373283 244
Where column x is the minimum difference in days.
Here, you only need to evaluate is if column x is less than or equal to 2*365
z$result = z$x<=2*365
Giving:
Id x result
1 3475684 1790 FALSE
2 5373283 244 TRUE
Final code
df = df[order(df$Id,df$Date),]
z = aggregate(df$Date,by = list(Id = df$Id),FUN = function(x){min(diff(x))})
z$result = z$x>=2*365

You can use something like this with library dplyr with the idea of taking top two dates in sorted order for each ID and see if they differ by two years:
library(dplyr)
df$Date <- as.Date(df$Date)
df %>%
group_by(Id) %>%
summarise(Status = as.numeric(difftime(max(Date), Date[order(Date, decreasing = TRUE)][2], units = 'days')) < 730)
Output will be as follows:
Source: local data frame [2 x 2]
Id Status
(int) (lgl)
1 3475684 FALSE
2 5373283 TRUE

Related

Calculate number of pending tasks at given time points (ideally with dplyr)

I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)

Counting the number of times an observation exists in a dataset using R? (with multiple criteria)

So I have this dataset of about 2,800 observations. The headers look a little something like this:
ItemName ItemNumber PromotedDate
ItemA 14321 12/31/2018
ItemB 14335 11/18/2018
ItemC 14542 10/05/2018
I want to be able to add a new column to this dataset, Number.Times.Promoted.Last.3.Months, that would count how many times each item exists in the dataset over the last three months of the PromotedDate variable.
I've tried creating some code (below) but it returns 0 for every row. When I just try it with the item number, I get the number of observations in the entire dataset.
df$Number.Times.Promoted.Last.Three.Months <- sum(df$ItemNumber == df$ItemNumber &
df$PromotedDate < df$PromotedDate &
df$PromotedDate > (as.Date(df$PromotedDate - 100)),
na.rm=TRUE)))
I'd love for the code to return the actual number of times each item in the dataset was promoted over the last 3 months since the PromotedDate variable, and for that to be attached to each row of the data (df). Would love some help in figuring out what I'm doing wrong. Thanks!
Note: In the file linked to there is a typo, the first ItemB starts with a lower case i. The code below works even if this is not corrected.
I find the following solution a bit too complicated but it does what the question asks for.
library(lubridate)
fun <- function(x){
ifelse(month(x) == 12 & day(x) == 31,
x - days(31 + 30 + 31),
x - months(3)
)
}
df <- readxl::read_xlsx("example_20190519.xlsx")
df$PromotedDate <- as.Date(df$PromotedDate)
sp <- split(df, tolower(df$ItemName))
res <- lapply(sp, function(DF){
tmp <- as.Date(fun(DF$PromotedDate), origin = "1970-01-01")
sapply(seq_len(nrow(DF)), function(i){
sum(DF$PromotedDate[i] > DF$PromotedDate & DF$PromotedDate > tmp[i])
})
})
df$New.3.Months <- NA
for(nm in names(res)) {
df$New.3.Months[tolower(df$ItemName) == nm] <- res[[nm]]
}
Now test to see if the result is the same as in the example .xlsx file.
all.equal(df$Times.Promoted.Last.3.Months, df$New.3.Months)
#[1] TRUE
And final cleanup.
rm(sp)
Here's an arguably simpler solution that relies on dplyr and fuzzyjoin.
First I define a day 90 days earlier**, and then join the list with itself, pulling in each Item match with a promotion date that is both "since 90 days before" and "up to current date." The number of rows for each Item-Date is the number of promotions within 90 days. By subtracting the row representing itself, we get the number of prior promotions.
** "90 days earlier" is simpler than "3mo earlier," which varies in length and is arguable for some dates: what's 3 months before May 30?
Prep
library(dplyr); library(fuzzyjoin); library(lubridate)
df <- readxl::read_excel(
"~/Downloads/example_20190519.xlsx",
col_types = c("text", "numeric", "date", "numeric"))
df_clean <- df %>% select(-Times.Promoted.Last.3.Months)
Solution
df_clean %>%
mutate(PromotedDate_less90 = PromotedDate - days(90)) %>%
# Pull in all matches (including current row) with matching Item and Promoted Date
# that is between Promoted Date and 90 days prior.
fuzzy_left_join(df_clean,
by = c("ItemName" = "ItemName",
"ItemNumber" = "ItemNumber",
"PromotedDate_less90" = "PromotedDate",
"PromotedDate" = "PromotedDate"),
match_fun = list(`==`, `==`, `<=`, `>=`)
) %>%
group_by(ItemName = ItemName.x,
ItemNumber = ItemNumber.x,
PromotedDate = PromotedDate.x) %>%
summarize(promotions_in_prior_90d = n() - 1) %>%
ungroup()
Output (in different order, but matching goal)
# A tibble: 12 x 4
ItemName ItemNumber PromotedDate promotions_in_prior_90d
<chr> <dbl> <dttm> <dbl>
1 ItemA 10021 2018-09-19 00:00:00 0
2 ItemA 10021 2018-10-15 00:00:00 1
3 ItemA 10021 2018-11-30 00:00:00 2
4 ItemA 10021 2018-12-31 00:00:00 2
5 itemB 10024 2018-12-15 00:00:00 0
6 ItemB 10024 2018-04-02 00:00:00 0
7 ItemB 10024 2018-06-05 00:00:00 1
8 ItemB 10024 2018-12-01 00:00:00 0
9 ItemC 19542 2018-07-20 00:00:00 0
10 ItemC 19542 2018-11-17 00:00:00 0
11 ItemC 19542 2018-12-01 00:00:00 1
12 ItemC 19542 2018-12-14 00:00:00 2

Assigning total to correct month from date range

I've got a data set with reservation data that has the below format :
property <- c('casa1', 'casa2', 'casa3')
check_in <- as.Date(c('2018-01-01', '2018-01-30','2018-02-28'))
check_out <- as.Date(c('2018-01-02', '2018-02-03', '2018-03-02'))
total_paid <- c(100,110,120)
df <- data.frame(property,check_in,check_out, total_paid)
My goal is to have the monthly total_paid amount divided by days and assigned to each month correctly for budget reasons.
While there's no issue for casa1, casa2 and casa3 have days reserved in both months and the totals get skewed because of this issue.
Any help much appreciated!
Here you go:
library(dplyr)
library(tidyr)
df %>%
mutate(id = seq_along(property), # make few variable to help
day_paid = total_paid / as.numeric(check_out - check_in),
date = check_in) %>%
group_by(id) %>%
complete(date = seq.Date(check_in, (check_out - 1), by = "day")) %>% # get date for each day of stay (except last)
ungroup() %>% # make one row per day of stay
mutate(month = cut(date, breaks = "month")) %>% # determine month of date
fill(property, check_in, check_out, total_paid, day_paid) %>%
group_by(id, month) %>%
summarise(property = unique(property),
check_in = unique(check_in),
check_out = unique(check_out),
total_paid = unique(total_paid),
paid_month = sum(day_paid)) # summarise per month
result:
# A tibble: 5 x 7
# Groups: id [3]
id month property check_in check_out total_paid paid_month
<int> <fct> <fct> <date> <date> <dbl> <dbl>
1 1 2018-01-01 casa1 2018-01-01 2018-01-02 100 100
2 2 2018-01-01 casa2 2018-01-30 2018-02-03 110 55
3 2 2018-02-01 casa2 2018-01-30 2018-02-03 110 55
4 3 2018-02-01 casa3 2018-02-28 2018-03-02 120 60
5 3 2018-03-01 casa3 2018-02-28 2018-03-02 120 60
I hope it's somewhat readable but please ask if there is something I should explain. Convention is that people don't pay the last day of a stay, so I took that into account.

Melting Data by Date Range

I'm running into an RStudio data issue regarding properly melting data. It currently is in the following form:
Campaign, ID, Start Date, End Date, Total Number of Days, Total Spend, Total Impressions, Total Conversions
I would like my data to look like the following:
Campaign, ID, Date, Spend, Impressions, Conversions
Each 'date' should contain a specific day the campaign was run while spend, impressions, and conversions should equal Total Spend / Total # of Days, Total Impressions / Total # of Days, and Total Conversions / Total # of Days, respectively.
I'm working in RStudio so a solution in R is needed. Does anyone have experience manipulating data like this?
This works, but it's not particularly efficient. If your data is millions of rows or more, I've had better luck using SQL and inequality joins.
library(tidyverse)
#create some bogus data
data <- data.frame(ID = 1:10,
StartDate = sample(seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day"), 10),
Total = runif(10)) %>%
mutate(EndDate = StartDate + floor(runif(10) * 14))
#generate all dates between the min and max in the dataset
AllDates = data.frame(Date = seq.Date(min(data$StartDate), max(data$EndDate), "day"),
Dummy = TRUE)
#join via a dummy variable to add rows for all dates to every ID
data %>%
mutate(Dummy = TRUE) %>%
inner_join(AllDates, by = c("Dummy" = "Dummy")) %>%
#filter to just the dates between the start and end
filter(Date >= StartDate, Date <= EndDate) %>%
#divide the total by the number of days
group_by(ID) %>%
mutate(TotalPerDay = Total / n()) %>%
select(ID, Date, TotalPerDay)
# A tibble: 91 x 3
# Groups: ID [10]
ID Date TotalPerDay
<int> <date> <dbl>
1 1 2018-06-21 0.00863
2 1 2018-06-22 0.00863
3 1 2018-06-23 0.00863
4 1 2018-06-24 0.00863
5 1 2018-06-25 0.00863
6 1 2018-06-26 0.00863
7 1 2018-06-27 0.00863
8 1 2018-06-28 0.00863
9 1 2018-06-29 0.00863
10 1 2018-06-30 0.00863
# ... with 81 more rows

Convert dplyr chain into a function

Given a column of dates, this will count the number of records in each month
library(dplyr)
library(lubridate)
samp <- tbl_df(seq.Date(as.Date("2017-01-01"), as.Date("2017-12-01"), by="day"))
freq <- samp %>%
filter(!is.na(value)) %>%
transmute(month = floor_date(value, "month")) %>%
group_by(month) %>% summarise(adds = n())
freq
# A tibble: 12 x 2
month adds
<date> <int>
1 2017-01-01 31
2 2017-02-01 28
3 2017-03-01 31
4 2017-04-01 30
5 2017-05-01 31
6 2017-06-01 30
7 2017-07-01 31
8 2017-08-01 31
9 2017-09-01 30
10 2017-10-01 31
11 2017-11-01 30
12 2017-12-01 1
>
I would like to convert this to a function, so that I can perform the operation on a number of variables. Have read the vignette on dplyr programming, but continue to have issues.
My attempt;
library(rlang)
count_x_month <- function(df, var, name){
var <- enquo(var)
name <- enquo(name)
df %>%
filter(!is.na(!!var)) %>%
transmute(month := floor_date(!!var, "month")) %>%
group_by(month) %>% summarise(!!name := n())
}
freq2 <- samp %>% count_x_month(value, out)
Error message;
Error: invalid argument type
Making this version of the function work will be a big help. More broadly, other ways to achieve the objective would be welcome.
One way to state the problem; given a dataframe of customers and first purchase dates, count the number of customers purchasing for the first time in each month.
update: The selected answer works in dplyr 0.7.4, but the rstudio environment I have access to has dplyr 0.5.0. What modifications are required to 'backport' this function?
You forgot to quo_name it
library(rlang)
count_x_month <- function(df, var, name){
var <- enquo(var)
name <- enquo(name)
name <- quo_name(name)
df %>%
filter(!is.na(!!var)) %>%
transmute(month := floor_date(!!var, "month")) %>%
group_by(month) %>%
summarise(!!name := n())
}
freq2 <- samp %>% count_x_month(value, out)
# A tibble: 12 x 2
month out
<date> <int>
1 2017-01-01 31
2 2017-02-01 28
3 2017-03-01 31
4 2017-04-01 30
5 2017-05-01 31
6 2017-06-01 30
7 2017-07-01 31
8 2017-08-01 31
9 2017-09-01 30
10 2017-10-01 31
11 2017-11-01 30
12 2017-12-01 1
See "Different input and output variable" section of "Programming with dplyr":
We create the new names by pasting together strings, so we need
quo_name() to convert the input expression to a string.
The error is caused by summarise(df, !!name := n()) and is solved by replacing the second line of the function with
name <- substitute(name)
The reason, as far as I understand it is, that a quosure is not only its name, but it carries with it the environment from where it came. This makes sense when specifying column names in functions. The function must know from which data frame (=environment in this case) the column comes to replace the name with the values.
However, name shall take a new name, specified by the user. There is nothing to replace it with. I suspect if using name <- enquo(name), R wants to replace !!name by values instead of just putting in the new name. Therefore it complains that on the LHS there is no name (because R replaced it by values(?))
Not sure though if substitute is the ideomatic "programming with dplyr" way though. Comments are welcome.
Create a dataframe showing customer IDs and first purchase dates:
dates <- seq.Date(as.Date("2017-01-01"), as.Date("2017-12-01"), by="day")
dates_rep <- c(dates,dates,dates)
cust_ids <- paste('id_', floor(runif(length(dates_rep), min=0, max=100000)))
cust_frame <- data.frame(ID=cust_ids, FP_DATE=dates_rep)
head(cust_frame)
Use the plyr package to aggregate by FP_DATE:
library(plyr)
count(cust_frame, c('FP_DATE'))
Therefore, given a dataframe of customers and first purchase dates, we get a count of the number of customers purchasing for the first time in each month.
You can extend this to aggregate across any number of features in your dataset:
count(cust_frame, c('FP_DATE', 'feature_b', 'feature_c', 'feature_d', 'feature_e'))

Resources