Day Time Numbers
6388 2017-02-01 10:43 R33
7129 2017-02-04 15:32 N39.0, N39.0, N39.0
9689 2017-02-17 08:54 S72.11, S72.11, S72.11, S72.11
6703 2017-02-02 18:55 R11
9026 2017-02-13 17:34 S06.0, S06.0, S06.0
5013 2017-01-25 00:33 J18.1, J18.1, J18.1, J18.1
5849 2017-01-29 17:57 I21.4, I21.4, I21.4
9245 2017-02-14 19:03 J18.0, J18.0, J18.0
1978 2017-01-09 21:23 K59.0
5021 2017-01-25 02:46 I47.1, I47.1, I47.1
9258 2017-02-14 20:19 S42.3
541 2017-01-03 11:44 I63.8, I63.8, I63.8
4207 2017-01-20 19:52 E83.58, E83.58, E83.58
8650 2017-02-11 18:39 R55, R55, S06.0, S06.0, R55
9442 2017-02-15 21:30 K86.1
4186 2017-01-20 18:27 S05.1
4231 2017-01-20 22:10 M17.9
6847 2017-02-03 11:45 L02.4
1739 2017-01-08 21:19 S20.2
3685 2017-01-18 09:56 G40.9
9497 2017-02-16 09:52 S83.6
2563 2017-01-12 20:47 M13.16, M25.56, M25.56
9731 2017-02-17 13:10 B99, B99, N39.0, N39.0
7759 2017-02-07 14:25 R51, G43.0, G43.0
368 2017-01-02 15:05 T83.0, T83.0, T83.0, N13.3, N13.6
I want to aggregate this df in a special way. I want to count how many Numbers starting e.g. "A" on each day. I want a new dataframe that looks like this:
Day GroupA GroupB GroupC .....
1 2017-01-01 2 2 0
2 2017-01-02 ..................
GroupA means Numbers starting with A. If there are multiple numbers starting with A in one single row it count be counted as one. The class of my number-column is character.
> class(df[1,3])
[1] "character"
> df[1,3]
[1] "A41.8, A41.51, A41.51"**
My problem is how I can combine the aggregate-command with the counts. My real df is a lot bigger, it is more than 2 years long, so I would need an automatized solution.
EDIT: See data down below
structure(list(Day= c("2017-01-07", "2017-01-23", "2017-01-08",
"2017-01-13", "2017-02-10", "2017-01-07", "2017-01-24", "2017-01-02",
"2017-01-03", "2017-01-06", "2017-01-11", "2017-01-21", "2017-01-13",
"2017-01-10", "2017-02-18", "2017-01-10", "2017-01-31", "2017-01-27",
"2017-01-23", "2017-01-13", "2017-02-10", "2017-01-09", "2017-01-23",
"2017-01-09", "2017-01-08"), Time= c("02:02", "14:51", "02:12",
"17:49", "00:00", "21:30", "22:28", "17:27", "12:14", "22:52",
"14:19", "11:40", "19:33", "04:01", "15:59", "14:57", "08:34",
"13:21", "02:01", "14:29", "20:17", "14:30", "02:34", "04:56",
"14:34"), Number= c("H10.9", "K85.80, K85.20, K85.80, K85.20",
"R09.1", "I10.90", "I48.9, I48.0, I48.9, I48.0", "A09.0, A09.0, R42, R42",
"H16.1", "K92.1, K92.1, K92.1", "K40.90, J12.2, J18.0, J96.01, J12.2",
"B99, J15.8, J18.0, J15.8", "S01.55", "M21.33", "I10.01, I10.01, J44.81, J44.81",
"S00.95", "B08.2", "S05.1", "M20.1", "G40.2, S93.40, S93.40",
"M25.51", "J44.19, J44.11, J44.19, J44.11", "G40.9, G40.2, G40.2",
"E87.1, E87.1, J18.0, J18.0", "I10.91", "R22.0", "S06.5, S06.5, S06.5, R06.88, S12.22"
)), .Names = c("Day", "Time", "Number"), row.names = c(1336L,
4687L, 1536L, 2737L, 8272L, 1507L, 4994L, 400L, 550L, 1305L,
2325L, 4292L, 2748L, 2008L, 9974L, 2113L, 6144L, 5433L, 4577L,
2697L, 8468L, 1883L, 4578L, 1783L, 1657L), class = "data.frame")
This is a pretty interesting problem that takes a little digging into. The first thing to do is get all the unique capital letters in each set in Number per row. stringr::str_extract_all gets you a list-column of string vectors that match this regex, and after taking unique values from each list entry, you have this:
library(dplyr)
library(tidyr)
as_tibble(df1) %>%
mutate(Day = lubridate::ymd(Day),
letters = purrr::map(stringr::str_extract_all(Number, "[A-Z]"), unique)) %>%
select(-Number) %>%
head()
#> # A tibble: 6 x 3
#> Day Time letters
#> <date> <chr> <list>
#> 1 2017-01-07 02:02 <chr [1]>
#> 2 2017-01-23 14:51 <chr [1]>
#> 3 2017-01-08 02:12 <chr [1]>
#> 4 2017-01-13 17:49 <chr [1]>
#> 5 2017-02-10 00:00 <chr [1]>
#> 6 2017-01-07 21:30 <chr [2]>
Unnest it so you have one row per date & time per letter, then count the number of observations of each letter per day—gets confusing, and the order matters here. Then reshape it into a wide format so each group gets a column.
as_tibble(df1) %>%
mutate(Day = lubridate::ymd(Day),
letters = purrr::map(stringr::str_extract_all(Number, "[A-Z]"), unique)) %>%
select(-Number) %>%
unnest(letters) %>%
count(Day, letters) %>%
arrange(letters) %>%
pivot_wider(names_from = letters, names_prefix = "group",
values_from = n, values_fill = list(n = 0)) %>%
head()
#> # A tibble: 6 x 12
#> Day groupA groupB groupE groupG groupH groupI groupJ groupK groupM
#> <date> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 2017-01-07 1 0 0 0 1 0 0 0 0
#> 2 2017-01-06 0 1 0 0 0 0 1 0 0
#> 3 2017-02-18 0 1 0 0 0 0 0 0 0
#> 4 2017-01-09 0 0 1 0 0 0 1 0 0
#> 5 2017-01-27 0 0 0 1 0 0 0 0 0
#> 6 2017-02-10 0 0 0 1 0 1 0 0 0
#> # … with 2 more variables: groupR <int>, groupS <int>
In this first few rows with the sample of data, there aren't any 2s, but there are some later on in the data frame. (I don't yet understand how pivot_wider orders things, but you can arrange by day after this if you want.)
Related
I have a database where animals in a herd are tested every 6 months (number of animals can change over the time). The issue is that all the animals in a herd are not tested on the same day but within a period of time of 2 months.
I would like to know who I can create a new column that merges all these close dates (grouping by herd), so I can calculate the number of times a herd has been tested.
This is an example of a herd that has been tested 8 times, but at different dates. Each dot represents an animal:
Here is an example of the data:
df <- data.frame(
animal = c("Animal1", "Animal2", "Animal3", "Animal4", "Animal5", "Animal6", "Animal1", "Animal2", "Animal3", "Animal4", "Animal5", "Animal6", "Animal7", "Animal8", "Animal9", "Animal10", "Animal11", "Animal12", "Animal7", "Animal8", "Animal9", "Animal10", "Animal11", "Animal12"),
herd = c("Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1", "Herd1","Herd1","Herd1","Herd2","Herd2", "Herd2","Herd2","Herd2","Herd2", "Herd2","Herd2", "Herd2","Herd2","Herd2","Herd2"),
date = c("2017-01-01", "2017-01-01", "2017-01-17","2017-02-04", "2017-02-04", "2017-02-05", "2017-06-01" , "2017-06-03", "2017-07-01", "2017-06-21", "2017-06-01", "2017-06-15", "2017-02-01", "2017-02-01", "2017-02-15", "2017-02-21", "2017-03-05", "2017-03-01", "2017-07-01", "2017-07-01", "2017-07-15", "2017-07-21", "2017-08-05", "2017-08-01"))
So the desired outcome will be:
animal herd date testing
1 Animal1 Herd1 2017-01-01 1
2 Animal2 Herd1 2017-01-01 1
3 Animal3 Herd1 2017-01-17 1
4 Animal4 Herd1 2017-02-04 1
5 Animal5 Herd1 2017-02-04 1
6 Animal6 Herd1 2017-02-05 1
7 Animal1 Herd1 2017-06-01 2
8 Animal2 Herd1 2017-06-03 2
9 Animal3 Herd1 2017-07-01 2
10 Animal4 Herd1 2017-06-21 2
11 Animal5 Herd1 2017-06-01 2
12 Animal6 Herd1 2017-06-15 2
13 Animal7 Herd2 2017-02-01 1
14 Animal8 Herd2 2017-02-01 1
15 Animal9 Herd2 2017-02-15 1
16 Animal10 Herd2 2017-02-21 1
17 Animal11 Herd2 2017-03-05 1
18 Animal12 Herd2 2017-03-01 1
19 Animal7 Herd2 2017-07-01 2
20 Animal8 Herd2 2017-07-01 2
21 Animal9 Herd2 2017-07-15 2
22 Animal10 Herd2 2017-07-21 2
23 Animal11 Herd2 2017-08-05 2
24 Animal12 Herd2 2017-08-01 2
I would like to apply something like this but considering that dates closed to each other are the same testing
df %>%
group_by(herd) %>%
mutate(testing = dense_rank(date))
Thanks!
You can group_by every 5 months and apply dense_rank. Since your smallest gap between two dates from the same animal is 5 months, the unit has to be 5 months.
library(dplyr)
library(lubridate)
df %>%
group_by(testing = dense_rank(floor_date(ymd(date), unit = "5 months")))
I have a dataframe that looks like this:
CYCLE date_cycle Randomization_Date COUPLEID
1 0 2016-02-16 10892
2 1 2016-08-17 2016-02-19 10894
3 1 2016-08-14 2016-02-26 10899
4 1 2016-02-26 10900
5 2 2016-03--- 2016-02-26 10900
6 3 2016-07-19 2016-02-26 10900
7 4 2016-11-15 2016-02-26 10900
8 1 2016-02-27 10901
9 2 2016-02--- 2016-02-27 10901
10 1 2016-03-27 2016-03-03 10902
11 2 2016-04-21 2016-03-03 10902
12 1 2016-03-03 10903
13 2 2016-03--- 2016-03-03 10903
14 0 2016-03-03 10904
15 1 2016-03-03 10905
16 2 2016-03-03 10905
17 3 2016-03-03 10905
18 4 2016-04-14 2016-03-03 10905
19 5 2016-05--- 2016-03-03 10905
20 6 2016-06--- 2016-03-03 10905
The goal is to fill in the missing day for a given ID using either an earlier or later date and add/subtract 28 from that.
The date_cycle variable was originally in the dataframe as a character type.
I have tried to code it as follows:
mutate(rowwise(df),
newdate = case_when( str_count(date1, pattern = "\\W") >2 ~ lag(as.Date.character(date1, "%Y-%m-%d"),1) + days(28)))
But I need to incorporate it by ID by CYCLE.
An example of my data could be made like this:
data.frame(stringsAsFactors = FALSE,
CYCLE =(0,1,1,1,2,3,4,1,2,1,2,1,2,0,1,2,3,4,5,6),
date_cycle = c(NA,"2016-08-17", "2016-08-14",NA,"2016-03---","2016-07-19", "2016-11-15",NA,"2016-02---", "2016-03-27","2016-04-21",NA, "2016-03---",NA,NA,NA,NA,"2016-04-14", "2016-05---","2016-06---"), Randomization_Date = c("2016-02-16","2016-02-19",
"2016-02-26","2016-02-26",
"2016-02-26","2016-02-26",
"2016-02-26",
"2016-02-27","2016-02-27",
"2016-03-03",
"2016-03-03","2016-03-03",
"2016-03-03","2016-03-03",
"2016-03-03",
"2016-03-03","2016-03-03",
"2016-03-03",
"2016-03-03","2016-03-03"),
COUPLEID = c(10892,10894,10899,10900,
10900,10900,10900,10901,10901,
10902,10902,10903,10903,10904,
10905,10905,10905,10905,10905,10905)
)
The output I am after would look like:
COUPLEID CYCLE date_cycle new_date_cycle
a 1 2014-03-27 2014-03-27
a 1 2014-04--- 2014-04-24
b 1 2014-03-24 2014-03-24
b 2 2014-04-21
b 3 2014-05--- 2014-05-19
c 1 2014-04--- 2014-04-02
c 2 2014-04-30 2014-04-30
I have also started to make a long conditional, but I wanted to ask here and see if anyone new of a more straight forward way to do it, instead of explicitly writing out all of the possible conditions.
mutate(rowwise(df),
newdate = case_when(
grp == 1 & str_count(date1, pattern = "\\W") >2 & !is.na(lead(date1,1) ~ lead(date1,1) - days(28),
grp == 2 & str_count(date1, pattern = "\\W") >2 & !is.na(lead(date1,1)) ~ lead(date1,1) - days(28),
grp == 3 & str_count(date1, pattern = "\\W") >2 & ...)))
Function to fill dates forward and backwards
filldates <- function(dates) {
m = which(!is.na(dates))
if(length(m)>0 & length(m)!=length(dates)) {
if(m[1]>1) for(i in seq(m,1,-1)) if(is.na(dates[i])) dates[i]=dates[i+1]-28
if(sum(is.na(dates))>0) for(i in seq_along(dates)) if(is.na(dates[i])) dates[i] = dates[i-1]+28
}
return(dates)
}
Usage:
data %>%
arrange(ID, grp) %>%
group_by(ID) %>%
mutate(date2=filldates(as.Date(date1,"%Y-%m-%d")))
Ouput:
ID grp date1 date2
<chr> <dbl> <chr> <date>
1 a 1 2014-03-27 2014-03-27
2 a 2 2014-04--- 2014-04-24
3 b 1 2014-03-24 2014-03-24
4 b 2 2014-04--- 2014-04-21
5 b 3 2014-05--- 2014-05-19
6 c 1 2014-03--- 2014-04-02
7 c 2 2014-04-30 2014-04-30
An option using purrr::accumulate().
library(tidyverse)
center <- df %>%
group_by(ID) %>%
mutate(helpDate = ymd(str_replace(date1, '---', '-01')),
refDate = max(ymd(date1), na.rm = T))
backward <- center %>%
filter(refDate == max(helpDate)) %>%
mutate(date2 = accumulate(refDate, ~ . - days(28), .dir = 'backward'))
forward <- center %>%
filter(refDate == min(helpDate)) %>%
mutate(date2 = accumulate(refDate, ~ . + days(28)))
bind_rows(forward, backward) %>%
ungroup() %>%
mutate(date2 = as_date(date2)) %>%
select(-c('helpDate', 'refDate'))
# # A tibble: 7 x 4
# ID grp date1 date2
# <chr> <int> <chr> <date>
# 1 a 1 2014-03-27 2014-03-27
# 2 a 2 2014-04--- 2014-04-24
# 3 b 1 2014-03-24 2014-03-24
# 4 b 2 2014-04--- 2014-04-21
# 5 b 3 2014-05--- 2014-05-19
# 6 c 1 2014-03--- 2014-04-02
# 7 c 2 2014-04-30 2014-04-30
Problem: Calculate the difference in log for each day (group by each day). The ideal result should produce NA for the first observation for each day.
library(dplyr)
library(tidyverse)
library(tibble)
library(lubridate)
df <- tibble(t = c("2019-10-01 09:30", "2019-10-01 09:35", "2019-10-01 09:40", "2019-10-02 09:30", "2019-10-02 09:35", "2019-10-02 09:40", "2019-10-03 09:30", "2019-10-03 09:35", "2019-10-03 09:40"), v = c(105.0061, 104.891, 104.8321, 104.5552, 104.4407, 104.5837, 104.5534, 103.6992, 103.5851)) # data
# my attempt
df %>%
# create day
mutate(day = day(t)) %>%
# group by day
group_by(day) %>%
# calculate log difference and append column
mutate(logdif = diff(log(df$v)))
The problem is
Error: Column `logdif` must be length 3 (the group size) or one, not 8
What I need:
[1] NA -0.0010967280 -0.0005616930 NA -0.0010957154
[6] 0.0013682615 NA -0.0082035450 -0.0011009036
Never use $ in dplyr pipes, also you need to append NA to diff output
library(dplyr)
df %>%
mutate(day = lubridate::day(t)) %>%
group_by(day) %>%
mutate(logdif = c(NA, diff(log(v))))
# t v day logdif
# <chr> <dbl> <int> <dbl>
#1 2019-10-01 09:30 105. 1 NA
#2 2019-10-01 09:35 105. 1 -0.00110
#3 2019-10-01 09:40 105. 1 -0.000562
#4 2019-10-02 09:30 105. 2 NA
#5 2019-10-02 09:35 104. 2 -0.00110
#6 2019-10-02 09:40 105. 2 0.00137
#7 2019-10-03 09:30 105. 3 NA
#8 2019-10-03 09:35 104. 3 -0.00820
#9 2019-10-03 09:40 104. 3 -0.00110
I have a data frame like this
transactionId user_id total_in_pennies created_at X yearmonth
1 345068 8 9900 2018-09-13 New Customer 2018-09-01
2 346189 8 9900 2018-09-20 Repeat Customer 2018-09-01
3 363500 8 7700 2018-10-11 Repeat Customer 2018-10-01
4 376089 8 7700 2018-10-25 Repeat Customer 2018-10-01
5 198450 11 0 2018-01-18 New Customer 2018-01-01
6 203966 11 0 2018-01-25 Repeat Customer 2018-01-01
it has many more rows, but that little snippet can be used.
I am trying to group using dplyr so I can get a final data frame like this
I use this code
df_RFM11 <- data2 %>% group_by(yearmonth) %>%
summarise(New_Customers=sum(X=="New Customer"), Repeat_Customers=sum(X=="Repeat Customer"), New_Customers_sales=sum(total_in_pennies & X=="New Customers"), Repeat_Customers_sales=sum(total_in_pennies & X=="Repeat Customers"))
and I get this result
> head(df_RFM11)
# A tibble: 6 x 5
yearmonth New_Customers Repeat_Customers New_Customers_sales Repeat_Customers_sales
<date> <int> <int> <int> <int>
1 2018-01-01 4880 2428 0 0
2 2018-02-01 2027 12068 0 0
3 2018-03-01 1902 15296 0 0
4 2018-04-01 1921 13363 0 0
5 2018-05-01 2631 18336 0 0
6 2018-06-01 2339 14492 0 0
and I am able to get the first 2 column I need, the count of new customers and repeat customers, but i get 0's when I try to get the sum of "total_in_pennies" for New Customers and repeat customer
Any help on what am i doing wrong?
You'd need to put them in brackets, like below:
df_RFM11 <- data2 %>%
group_by(yearmonth) %>%
summarise(New_Customers=sum(X=="New Customer"),
Repeat_Customers=sum(X=="Repeat Customer"),
New_Customers_sales=sum(total_in_pennies[X=="New Customer"]),
Repeat_Customers_sales=sum(total_in_pennies[X=="Repeat Customer"])
)
I was agonizing over how to phrase my question. I have a data frame of accounts and I want to create a new column that is a flag for whether there is another account that has a duplicate email within 30 days of that account.
I have a table like this.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(AccountNumbers,EmailAddress,Dates)
print(df)
AccountNumbers EmailAddress Dates
3748 John#gmail.com 2018-05-01
8894 John#gmail.com 2018-05-05
9923 Alex#outlook.com 2018-05-10
4502 Alan#yahoo.com 2018-05-15
7283 Stan#aol.com 2018-05-20
8012 Mary#outlook.com 2018-05-25
2938 Adam#outlook.com 2018-05-30
7485 Tom#aol.com 2018-06-01
1010 Jane#yahoo.com 2018-06-05
9877 John#gmail.com 2018-06-10
John#gmail.com appears three times, I want to flag the first two rows because they both appear within 30 days of each other, but I don't want to flag the third.
AccountNumbers EmailAddress Dates DuplicateEmailFlag
3748 John#gmail.com 2018-05-01 1
8894 John#gmail.com 2018-05-05 1
9923 Alex#outlook.com 2018-05-10 0
4502 Alan#yahoo.com 2018-05-15 0
7283 Stan#aol.com 2018-05-20 0
8012 Mary#outlook.com 2018-05-25 0
2938 Adam#outlook.com 2018-05-30 0
7485 Tom#aol.com 2018-06-01 0
1010 Jane#yahoo.com 2018-06-05 0
9877 John#gmail.com 2018-06-10 0
I've been trying to use an ifelse() inside of mutate, but I don't know if it's possible to tell dplyr to only consider rows that are within 30 days of the row being considered.
Edit: To clarify, I want to look at the 30 days around each account. So that if I had a scenario where the same email address was being added exactly every 30 days, all of the occurrences of that email should be flagged.
This seems to work. First, I define the data frame.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(number = AccountNumbers, email = EmailAddress, date = as.Date(Dates))
Next, I group by email and check if there's an entry in the preceding or following 30 days. I also replace NAs (corresponding to cases with only one entry) with 0. Finally, I ungroup.
df %>%
group_by(email) %>%
mutate(dupe = coalesce(date - lag(date) < 30, (date - lead(date) < 30))) %>%
mutate(dupe = replace_na(dupe, 0)) %>%
ungroup
This gives,
# # A tibble: 10 x 4
# number email date dupe
# <dbl> <fct> <date> <dbl>
# 1 3748 John#gmail.com 2018-05-01 1
# 2 8894 John#gmail.com 2018-05-05 1
# 3 9923 Alex#outlook.com 2018-05-10 0
# 4 4502 Alan#yahoo.com 2018-05-15 0
# 5 7283 Stan#aol.com 2018-05-20 0
# 6 8012 Mary#outlook.com 2018-05-25 0
# 7 2938 Adam#outlook.com 2018-05-30 0
# 8 7485 Tom#aol.com 2018-06-01 0
# 9 1010 Jane#yahoo.com 2018-06-05 0
# 10 9877 John#gmail.com 2018-06-10 0
as required.
Edit: This makes the implicit assumption that your data are sorted by date. If not, you'd need to add an extra step to do so.
I think this gets at what you want:
df %>%
group_by(EmailAddress) %>%
mutate(helper = cumsum(coalesce(if_else(difftime(Dates, lag(Dates), 'days') <= 30, 0, 1), 0))) %>%
group_by(EmailAddress, helper) %>%
mutate(DuplicateEmailFlag = (n() >= 2)*1) %>%
ungroup() %>%
select(-helper)
# A tibble: 10 x 4
AccountNumbers EmailAddress Dates DuplicateEmailFlag
<dbl> <chr> <date> <dbl>
1 3748 John#gmail.com 2018-05-01 1
2 8894 John#gmail.com 2018-05-05 1
3 9923 Alex#outlook.com 2018-05-10 0
4 4502 Alan#yahoo.com 2018-05-15 0
5 7283 Stan#aol.com 2018-05-20 0
6 8012 Mary#outlook.com 2018-05-25 0
7 2938 Adam#outlook.com 2018-05-30 0
8 7485 Tom#aol.com 2018-06-01 0
9 1010 Jane#yahoo.com 2018-06-05 0
10 9877 John#gmail.com 2018-06-10 0
Note:
I think #Lyngbakr's solution is better for the circumstances in your question. Mine would be more appropriate if the size of the duplicate group might change (e.g., you want to check for 3 or 4 entries within 30 days of each other, rather than 2).
slightly modified data
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- as.Date(c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10"))
df <- data.frame(AccountNumbers,EmailAddress,Dates, stringsAsFactors = FALSE)