lag date upon condition, carry over - r

I have repeated measurements over individuals who have made donations, or not, when solicited. I wish I could carry over the last successful solicitation date to the next observations until a new success is hit.
Here is my sample data:
set.seed(13)
df <- data.frame(ID=rep(letters[1:3], each=4),
SolicitationDate= sample(seq(as.Date('2016/01/01'),
as.Date('2018/01/01'), by="day"), 3),
Success=rbinom(4,1,0.2))
df$ExpectedResult <- c(NA, NA, "2016-06-28", "2016-06-28",
NA, NA, "2016-10-11", "2016-10-11",
NA,NA,"2017-06-03", "2017-06-03")
Should an individual have multiple successes, the last success date should be carrried over.
Thanks
Romain

Here's a version using tidyverse. I think your expected output may be off as the dates should be ordered within ID but that may be wrong. In that case let me know.
df %>%
group_by(ID) %>% # Group by ID
arrange(SolicitationDate) %>% # Sort according to date
mutate(res=replace(SolicitationDate, Success==0, NA)) %>% # Create new value
tidyr::fill(res) # Fill down
This will give you
# A tibble: 12 x 4
# Groups: ID [3]
ID SolicitationDate Success res
<fct> <date> <int> <date>
1 a 2016-06-28 1 2016-06-28
2 a 2016-10-11 0 2016-06-28
3 a 2017-06-03 0 2016-06-28
4 a 2017-06-03 0 2016-06-28
5 b 2016-06-28 0 NA
6 b 2016-06-28 0 NA
7 b 2016-10-11 1 2016-10-11
8 b 2017-06-03 0 2016-10-11
9 c 2016-06-28 0 NA
10 c 2016-10-11 0 NA
11 c 2016-10-11 0 NA
12 c 2017-06-03 1 2017-06-03
I'm not sure if you want the success dates to be part of the result or not. If not then you could set to missing and fill down again. In any case: hope this helps.

Related

dplyr grouping and using a conditional from multiple columns

I have a data frame like this
transactionId user_id total_in_pennies created_at X yearmonth
1 345068 8 9900 2018-09-13 New Customer 2018-09-01
2 346189 8 9900 2018-09-20 Repeat Customer 2018-09-01
3 363500 8 7700 2018-10-11 Repeat Customer 2018-10-01
4 376089 8 7700 2018-10-25 Repeat Customer 2018-10-01
5 198450 11 0 2018-01-18 New Customer 2018-01-01
6 203966 11 0 2018-01-25 Repeat Customer 2018-01-01
it has many more rows, but that little snippet can be used.
I am trying to group using dplyr so I can get a final data frame like this
I use this code
df_RFM11 <- data2 %>% group_by(yearmonth) %>%
summarise(New_Customers=sum(X=="New Customer"), Repeat_Customers=sum(X=="Repeat Customer"), New_Customers_sales=sum(total_in_pennies & X=="New Customers"), Repeat_Customers_sales=sum(total_in_pennies & X=="Repeat Customers"))
and I get this result
> head(df_RFM11)
# A tibble: 6 x 5
yearmonth New_Customers Repeat_Customers New_Customers_sales Repeat_Customers_sales
<date> <int> <int> <int> <int>
1 2018-01-01 4880 2428 0 0
2 2018-02-01 2027 12068 0 0
3 2018-03-01 1902 15296 0 0
4 2018-04-01 1921 13363 0 0
5 2018-05-01 2631 18336 0 0
6 2018-06-01 2339 14492 0 0
and I am able to get the first 2 column I need, the count of new customers and repeat customers, but i get 0's when I try to get the sum of "total_in_pennies" for New Customers and repeat customer
Any help on what am i doing wrong?
You'd need to put them in brackets, like below:
df_RFM11 <- data2 %>%
group_by(yearmonth) %>%
summarise(New_Customers=sum(X=="New Customer"),
Repeat_Customers=sum(X=="Repeat Customer"),
New_Customers_sales=sum(total_in_pennies[X=="New Customer"]),
Repeat_Customers_sales=sum(total_in_pennies[X=="Repeat Customer"])
)

Break up rows representing long time intervals into multiple rows

I have a dataframe (tibble) with multiple rows, each row contains an IDNR, a start date, an end date and an exposure status. The IDNR is a character variable, the start and end date are date variables and the exposure status is a numerical variable. This is what the top 3 rows look like:
# A tibble: 48,266 x 4
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-07-01 0
2 2 2017-10-30 2018-07-01 0
3 3 2016-02-11 2016-12-03 1
# ... with 48,256 more rows
In order to do a time-varying cox regression, I want to split up the rows into 90 day parts, while maintaining the start and end date. Here is an example of what I would like to achieve. What happens, is that the new end date is start + 90 days, and a new row is created. This row has the start date which is the same as the end date from the previous row. If the time between start and end is now less than 90 days, this is fine (as for IDNR 1 and 3), however, for IDNR 2 the time is still exceeding 90 days. Therefore a third row needs to be added.
# A tibble: 48,266 x 4
# Groups: IDNR [33,240]
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-08-09 1
7 3 2016-08-09 2016-12-03 1
I'm relatively new to coding in R, but I've found dplyr to be very useful so far. So, if someone knows a solution using dplyr I would really appreciate that.
Thanks in advance!
Here you go:
Using df as your data frame:
df = data.frame(IDNR = 1:3,
start = c("2018-02-15","2017-10-30","2016-02-11"),
end = c("2018-07-01","2018-07-01","2016-12-03"),
exposure = c(0,0,1))
Do:
library(lubridate)
newDF = apply(df, 1, function(x){
newStart = seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)
newEnd = c(seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)[-1], ymd(x["end"]))
d = data.frame(IDNR = rep(x["IDNR"], length(newStart)),
start = newStart,
end = newEnd,
exposure = rep(x["exposure"], length(newStart)))
})
newDF = do.call(rbind, newDF)
newDF = newDF[newDF$start != newDF$end,]
Result:
> newDF
IDNR start end exposure
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-05-11 1
7 3 2016-05-11 2016-08-09 1
8 3 2016-08-09 2016-11-07 1
9 3 2016-11-07 2016-12-03 1
What this does is create a sequence of days from start to end by 90 days and create a smaller data frame with them along with the IDNR and exposure. This apply will return a list of data frames that you can join together using do.call. The last line removes lines that have the same start and end date

How can I use mutate to create a new column based only on a subset of other rows of a data frame?

I was agonizing over how to phrase my question. I have a data frame of accounts and I want to create a new column that is a flag for whether there is another account that has a duplicate email within 30 days of that account.
I have a table like this.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(AccountNumbers,EmailAddress,Dates)
print(df)
AccountNumbers EmailAddress Dates
3748 John#gmail.com 2018-05-01
8894 John#gmail.com 2018-05-05
9923 Alex#outlook.com 2018-05-10
4502 Alan#yahoo.com 2018-05-15
7283 Stan#aol.com 2018-05-20
8012 Mary#outlook.com 2018-05-25
2938 Adam#outlook.com 2018-05-30
7485 Tom#aol.com 2018-06-01
1010 Jane#yahoo.com 2018-06-05
9877 John#gmail.com 2018-06-10
John#gmail.com appears three times, I want to flag the first two rows because they both appear within 30 days of each other, but I don't want to flag the third.
AccountNumbers EmailAddress Dates DuplicateEmailFlag
3748 John#gmail.com 2018-05-01 1
8894 John#gmail.com 2018-05-05 1
9923 Alex#outlook.com 2018-05-10 0
4502 Alan#yahoo.com 2018-05-15 0
7283 Stan#aol.com 2018-05-20 0
8012 Mary#outlook.com 2018-05-25 0
2938 Adam#outlook.com 2018-05-30 0
7485 Tom#aol.com 2018-06-01 0
1010 Jane#yahoo.com 2018-06-05 0
9877 John#gmail.com 2018-06-10 0
I've been trying to use an ifelse() inside of mutate, but I don't know if it's possible to tell dplyr to only consider rows that are within 30 days of the row being considered.
Edit: To clarify, I want to look at the 30 days around each account. So that if I had a scenario where the same email address was being added exactly every 30 days, all of the occurrences of that email should be flagged.
This seems to work. First, I define the data frame.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(number = AccountNumbers, email = EmailAddress, date = as.Date(Dates))
Next, I group by email and check if there's an entry in the preceding or following 30 days. I also replace NAs (corresponding to cases with only one entry) with 0. Finally, I ungroup.
df %>%
group_by(email) %>%
mutate(dupe = coalesce(date - lag(date) < 30, (date - lead(date) < 30))) %>%
mutate(dupe = replace_na(dupe, 0)) %>%
ungroup
This gives,
# # A tibble: 10 x 4
# number email date dupe
# <dbl> <fct> <date> <dbl>
# 1 3748 John#gmail.com 2018-05-01 1
# 2 8894 John#gmail.com 2018-05-05 1
# 3 9923 Alex#outlook.com 2018-05-10 0
# 4 4502 Alan#yahoo.com 2018-05-15 0
# 5 7283 Stan#aol.com 2018-05-20 0
# 6 8012 Mary#outlook.com 2018-05-25 0
# 7 2938 Adam#outlook.com 2018-05-30 0
# 8 7485 Tom#aol.com 2018-06-01 0
# 9 1010 Jane#yahoo.com 2018-06-05 0
# 10 9877 John#gmail.com 2018-06-10 0
as required.
Edit: This makes the implicit assumption that your data are sorted by date. If not, you'd need to add an extra step to do so.
I think this gets at what you want:
df %>%
group_by(EmailAddress) %>%
mutate(helper = cumsum(coalesce(if_else(difftime(Dates, lag(Dates), 'days') <= 30, 0, 1), 0))) %>%
group_by(EmailAddress, helper) %>%
mutate(DuplicateEmailFlag = (n() >= 2)*1) %>%
ungroup() %>%
select(-helper)
# A tibble: 10 x 4
AccountNumbers EmailAddress Dates DuplicateEmailFlag
<dbl> <chr> <date> <dbl>
1 3748 John#gmail.com 2018-05-01 1
2 8894 John#gmail.com 2018-05-05 1
3 9923 Alex#outlook.com 2018-05-10 0
4 4502 Alan#yahoo.com 2018-05-15 0
5 7283 Stan#aol.com 2018-05-20 0
6 8012 Mary#outlook.com 2018-05-25 0
7 2938 Adam#outlook.com 2018-05-30 0
8 7485 Tom#aol.com 2018-06-01 0
9 1010 Jane#yahoo.com 2018-06-05 0
10 9877 John#gmail.com 2018-06-10 0
Note:
I think #Lyngbakr's solution is better for the circumstances in your question. Mine would be more appropriate if the size of the duplicate group might change (e.g., you want to check for 3 or 4 entries within 30 days of each other, rather than 2).
slightly modified data
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- as.Date(c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10"))
df <- data.frame(AccountNumbers,EmailAddress,Dates, stringsAsFactors = FALSE)

Count number of rows for each row that meet a logical condition

So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?
EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4

Mark each row in a large dataframe via two variables

I have a dataframe like this (the real one is much larger):
time<-c(as.POSIXct('2011-11-11 06:00:00'),as.POSIXct('2011-11-11 06:05:00'),as.POSIXct('2011-11-11 07:05:00'),
as.POSIXct('2011-11-11 07:10:00'),as.POSIXct('2011-11-11 07:13:00'),as.POSIXct('2011-11-11 07:33:00'),
as.POSIXct('2011-11-11 05:05:00'),as.POSIXct('2011-11-11 06:05:00'),as.POSIXct('2011-11-11 06:20:00'),
as.POSIXct('2011-11-11 09:05:00'))
plate<-c('a','a','a','b','c','d','e','e','e','e')
df<-data.frame(time,plate)
The time variable represents the time that the vehicle be identified by the video device. The plate variable represents the vehicle's plate. The dataframe has been well ordered by firstly plate and secondly time.
Given this, I want to devide each vehicle's trip by marking the rows. Different vehicles (plates) certainly represent different trips. For one vehicle, the identified time difference within one trip should be shorter than 30 minutes, if not, the rows should belong to different trips.
In my way, I will do this by the following code:
trip<-vector()
trip[1]<-1
time_diff<-as.POSIXct('2011-11-11 07:00:00')-as.POSIXct('2011-11-11 06:30:00')
for (x in 2:nrow(df)) {
if (!df$plate[x]==df$plate[x-1]) (trip[x]<-trip[x-1]+1
) else{if (df$time[x]-df$time[x-1]<time_diff) (trip[x]<-trip[x-1]
) else (trip[x]<-trip[x-1]+1)}
}
df<-cbind(df,trip)
However, my df contains more than seven million rows thus my method will be very slow. So I'm asking if there are some more efficient ways to do this.
I'll suggest using dplyr for this, though with 7M rows you might consider data.table solution if this doesn't work well for you.
library(dplyr)
time_diff<-as.POSIXct('2011-11-11 07:00:00')-as.POSIXct('2011-11-11 06:30:00')
df %>%
arrange(time) %>% # it's important, so I reinforce it here
group_by(plate) %>%
mutate(
trip = cumsum( c(TRUE, diff(time) > time_diff) )
) %>%
ungroup()
# # A tibble: 10 × 3
# time plate trip
# <dttm> <fctr> <int>
# 1 2011-11-11 06:00:00 a 1
# 2 2011-11-11 06:05:00 a 1
# 3 2011-11-11 07:05:00 a 2
# 4 2011-11-11 07:10:00 b 1
# 5 2011-11-11 07:13:00 c 1
# 6 2011-11-11 07:33:00 d 1
# 7 2011-11-11 05:05:00 e 1
# 8 2011-11-11 06:05:00 e 2
# 9 2011-11-11 06:20:00 e 2
# 10 2011-11-11 09:05:00 e 3
I much prefer the above solution using group_by, but if you want the trip to be unique across plates, one technique is to handle the grouping yourself (requiring strict ordering):
df %>%
arrange(plate, time) %>%
mutate(
trip = cumsum( plate != lag(plate, default = plate[1]) | c(TRUE, diff(time) > time_diff) )
)
# time plate trip
# 1 2011-11-11 06:00:00 a 1
# 2 2011-11-11 06:05:00 a 1
# 3 2011-11-11 07:05:00 a 2
# 4 2011-11-11 07:10:00 b 3
# 5 2011-11-11 07:13:00 c 4
# 6 2011-11-11 07:33:00 d 5
# 7 2011-11-11 05:05:00 e 6
# 8 2011-11-11 06:05:00 e 7
# 9 2011-11-11 06:20:00 e 7
# 10 2011-11-11 09:05:00 e 8

Resources