dplyr grouping and using a conditional from multiple columns

dplyr grouping and using a conditional from multiple columns - r

I have a data frame like this
transactionId user_id total_in_pennies created_at X yearmonth
1 345068 8 9900 2018-09-13 New Customer 2018-09-01
2 346189 8 9900 2018-09-20 Repeat Customer 2018-09-01
3 363500 8 7700 2018-10-11 Repeat Customer 2018-10-01
4 376089 8 7700 2018-10-25 Repeat Customer 2018-10-01
5 198450 11 0 2018-01-18 New Customer 2018-01-01
6 203966 11 0 2018-01-25 Repeat Customer 2018-01-01
it has many more rows, but that little snippet can be used.
I am trying to group using dplyr so I can get a final data frame like this
I use this code
df_RFM11 <- data2 %>% group_by(yearmonth) %>%
summarise(New_Customers=sum(X=="New Customer"), Repeat_Customers=sum(X=="Repeat Customer"), New_Customers_sales=sum(total_in_pennies & X=="New Customers"), Repeat_Customers_sales=sum(total_in_pennies & X=="Repeat Customers"))
and I get this result
> head(df_RFM11)
# A tibble: 6 x 5
yearmonth New_Customers Repeat_Customers New_Customers_sales Repeat_Customers_sales
<date> <int> <int> <int> <int>
1 2018-01-01 4880 2428 0 0
2 2018-02-01 2027 12068 0 0
3 2018-03-01 1902 15296 0 0
4 2018-04-01 1921 13363 0 0
5 2018-05-01 2631 18336 0 0
6 2018-06-01 2339 14492 0 0
and I am able to get the first 2 column I need, the count of new customers and repeat customers, but i get 0's when I try to get the sum of "total_in_pennies" for New Customers and repeat customer
Any help on what am i doing wrong?

You'd need to put them in brackets, like below:
df_RFM11 <- data2 %>%
group_by(yearmonth) %>%
summarise(New_Customers=sum(X=="New Customer"),
Repeat_Customers=sum(X=="Repeat Customer"),
New_Customers_sales=sum(total_in_pennies[X=="New Customer"]),
Repeat_Customers_sales=sum(total_in_pennies[X=="Repeat Customer"])
)

Related

Break up rows representing long time intervals into multiple rows

I have a dataframe (tibble) with multiple rows, each row contains an IDNR, a start date, an end date and an exposure status. The IDNR is a character variable, the start and end date are date variables and the exposure status is a numerical variable. This is what the top 3 rows look like:
# A tibble: 48,266 x 4
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-07-01 0
2 2 2017-10-30 2018-07-01 0
3 3 2016-02-11 2016-12-03 1
# ... with 48,256 more rows
In order to do a time-varying cox regression, I want to split up the rows into 90 day parts, while maintaining the start and end date. Here is an example of what I would like to achieve. What happens, is that the new end date is start + 90 days, and a new row is created. This row has the start date which is the same as the end date from the previous row. If the time between start and end is now less than 90 days, this is fine (as for IDNR 1 and 3), however, for IDNR 2 the time is still exceeding 90 days. Therefore a third row needs to be added.
# A tibble: 48,266 x 4
# Groups: IDNR [33,240]
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-08-09 1
7 3 2016-08-09 2016-12-03 1
I'm relatively new to coding in R, but I've found dplyr to be very useful so far. So, if someone knows a solution using dplyr I would really appreciate that.
Thanks in advance!

Here you go:
Using df as your data frame:
df = data.frame(IDNR = 1:3,
start = c("2018-02-15","2017-10-30","2016-02-11"),
end = c("2018-07-01","2018-07-01","2016-12-03"),
exposure = c(0,0,1))
Do:
library(lubridate)
newDF = apply(df, 1, function(x){
newStart = seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)
newEnd = c(seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)[-1], ymd(x["end"]))
d = data.frame(IDNR = rep(x["IDNR"], length(newStart)),
start = newStart,
end = newEnd,
exposure = rep(x["exposure"], length(newStart)))
})
newDF = do.call(rbind, newDF)
newDF = newDF[newDF$start != newDF$end,]
Result:
> newDF
IDNR start end exposure
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-05-11 1
7 3 2016-05-11 2016-08-09 1
8 3 2016-08-09 2016-11-07 1
9 3 2016-11-07 2016-12-03 1
What this does is create a sequence of days from start to end by 90 days and create a smaller data frame with them along with the IDNR and exposure. This apply will return a list of data frames that you can join together using do.call. The last line removes lines that have the same start and end date

How can I use mutate to create a new column based only on a subset of other rows of a data frame?

I was agonizing over how to phrase my question. I have a data frame of accounts and I want to create a new column that is a flag for whether there is another account that has a duplicate email within 30 days of that account.
I have a table like this.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(AccountNumbers,EmailAddress,Dates)
print(df)
AccountNumbers EmailAddress Dates
3748 John#gmail.com 2018-05-01
8894 John#gmail.com 2018-05-05
9923 Alex#outlook.com 2018-05-10
4502 Alan#yahoo.com 2018-05-15
7283 Stan#aol.com 2018-05-20
8012 Mary#outlook.com 2018-05-25
2938 Adam#outlook.com 2018-05-30
7485 Tom#aol.com 2018-06-01
1010 Jane#yahoo.com 2018-06-05
9877 John#gmail.com 2018-06-10
John#gmail.com appears three times, I want to flag the first two rows because they both appear within 30 days of each other, but I don't want to flag the third.
AccountNumbers EmailAddress Dates DuplicateEmailFlag
3748 John#gmail.com 2018-05-01 1
8894 John#gmail.com 2018-05-05 1
9923 Alex#outlook.com 2018-05-10 0
4502 Alan#yahoo.com 2018-05-15 0
7283 Stan#aol.com 2018-05-20 0
8012 Mary#outlook.com 2018-05-25 0
2938 Adam#outlook.com 2018-05-30 0
7485 Tom#aol.com 2018-06-01 0
1010 Jane#yahoo.com 2018-06-05 0
9877 John#gmail.com 2018-06-10 0
I've been trying to use an ifelse() inside of mutate, but I don't know if it's possible to tell dplyr to only consider rows that are within 30 days of the row being considered.
Edit: To clarify, I want to look at the 30 days around each account. So that if I had a scenario where the same email address was being added exactly every 30 days, all of the occurrences of that email should be flagged.

This seems to work. First, I define the data frame.
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10")
df <- data.frame(number = AccountNumbers, email = EmailAddress, date = as.Date(Dates))
Next, I group by email and check if there's an entry in the preceding or following 30 days. I also replace NAs (corresponding to cases with only one entry) with 0. Finally, I ungroup.
df %>%
group_by(email) %>%
mutate(dupe = coalesce(date - lag(date) < 30, (date - lead(date) < 30))) %>%
mutate(dupe = replace_na(dupe, 0)) %>%
ungroup
This gives,
# # A tibble: 10 x 4
# number email date dupe
# <dbl> <fct> <date> <dbl>
# 1 3748 John#gmail.com 2018-05-01 1
# 2 8894 John#gmail.com 2018-05-05 1
# 3 9923 Alex#outlook.com 2018-05-10 0
# 4 4502 Alan#yahoo.com 2018-05-15 0
# 5 7283 Stan#aol.com 2018-05-20 0
# 6 8012 Mary#outlook.com 2018-05-25 0
# 7 2938 Adam#outlook.com 2018-05-30 0
# 8 7485 Tom#aol.com 2018-06-01 0
# 9 1010 Jane#yahoo.com 2018-06-05 0
# 10 9877 John#gmail.com 2018-06-10 0
as required.
Edit: This makes the implicit assumption that your data are sorted by date. If not, you'd need to add an extra step to do so.

I think this gets at what you want:
df %>%
group_by(EmailAddress) %>%
mutate(helper = cumsum(coalesce(if_else(difftime(Dates, lag(Dates), 'days') <= 30, 0, 1), 0))) %>%
group_by(EmailAddress, helper) %>%
mutate(DuplicateEmailFlag = (n() >= 2)*1) %>%
ungroup() %>%
select(-helper)
# A tibble: 10 x 4
AccountNumbers EmailAddress Dates DuplicateEmailFlag
<dbl> <chr> <date> <dbl>
1 3748 John#gmail.com 2018-05-01 1
2 8894 John#gmail.com 2018-05-05 1
3 9923 Alex#outlook.com 2018-05-10 0
4 4502 Alan#yahoo.com 2018-05-15 0
5 7283 Stan#aol.com 2018-05-20 0
6 8012 Mary#outlook.com 2018-05-25 0
7 2938 Adam#outlook.com 2018-05-30 0
8 7485 Tom#aol.com 2018-06-01 0
9 1010 Jane#yahoo.com 2018-06-05 0
10 9877 John#gmail.com 2018-06-10 0
Note:
I think #Lyngbakr's solution is better for the circumstances in your question. Mine would be more appropriate if the size of the duplicate group might change (e.g., you want to check for 3 or 4 entries within 30 days of each other, rather than 2).
slightly modified data
AccountNumbers <- c(3748,8894,9923,4502,7283,8012,2938,7485,1010,9877)
EmailAddress <- c("John#gmail.com","John#gmail.com","Alex#outlook.com","Alan#yahoo.com","Stan#aol.com","Mary#outlook.com","Adam#outlook.com","Tom#aol.com","Jane#yahoo.com","John#gmail.com")
Dates <- as.Date(c("2018-05-01","2018-05-05","2018-05-10","2018-05-15","2018-05-20",
"2018-05-25","2018-05-30","2018-06-01","2018-06-05","2018-06-10"))
df <- data.frame(AccountNumbers,EmailAddress,Dates, stringsAsFactors = FALSE)

lag date upon condition, carry over

I have repeated measurements over individuals who have made donations, or not, when solicited. I wish I could carry over the last successful solicitation date to the next observations until a new success is hit.
Here is my sample data:
set.seed(13)
df <- data.frame(ID=rep(letters[1:3], each=4),
SolicitationDate= sample(seq(as.Date('2016/01/01'),
as.Date('2018/01/01'), by="day"), 3),
Success=rbinom(4,1,0.2))
df$ExpectedResult <- c(NA, NA, "2016-06-28", "2016-06-28",
NA, NA, "2016-10-11", "2016-10-11",
NA,NA,"2017-06-03", "2017-06-03")
Should an individual have multiple successes, the last success date should be carrried over.
Thanks
Romain

Here's a version using tidyverse. I think your expected output may be off as the dates should be ordered within ID but that may be wrong. In that case let me know.
df %>%
group_by(ID) %>% # Group by ID
arrange(SolicitationDate) %>% # Sort according to date
mutate(res=replace(SolicitationDate, Success==0, NA)) %>% # Create new value
tidyr::fill(res) # Fill down
This will give you
# A tibble: 12 x 4
# Groups: ID [3]
ID SolicitationDate Success res
<fct> <date> <int> <date>
1 a 2016-06-28 1 2016-06-28
2 a 2016-10-11 0 2016-06-28
3 a 2017-06-03 0 2016-06-28
4 a 2017-06-03 0 2016-06-28
5 b 2016-06-28 0 NA
6 b 2016-06-28 0 NA
7 b 2016-10-11 1 2016-10-11
8 b 2017-06-03 0 2016-10-11
9 c 2016-06-28 0 NA
10 c 2016-10-11 0 NA
11 c 2016-10-11 0 NA
12 c 2017-06-03 1 2017-06-03
I'm not sure if you want the success dates to be part of the result or not. If not then you could set to missing and fill down again. In any case: hope this helps.

Aggregating time on hourly basis and counting it

I have following dataframe in R.
Date Car_NO
2016-12-24 19:35:00 ABC
2016-12-24 19:55:00 DEF
2016-12-24 20:15:00 RTY
2016-12-24 20:35:00 WER
2016-12-24 21:34:00 DER
2016-12-24 00:23:00 ABC
2016-12-24 00:22:00 ERT
2016-12-24 11:45:00 RTY
2016-12-24 13:09:00 RTY
Date format is "POSIXct" "POSIXt"
I want to count hourly movement of car traffic. like 12-1,1-2,2-3,3-4 and so on
Currently my approach is following
df$time <- ymd_hms(df$Date)
df$hours <- hour(df$time)
df$minutes <- minute(df$time)
df$time <- as.numeric(paste(df$hours,df$minutes,sep="."))
And after this I will apply ifelse loop to divide it in hourly time slots,but I think it will be long and tedious way to do it. Is there any easy approach in R.
My desired dataframe would be
Time_Slots Car_Traffic_count
00-01 2
01-02 0
02-03 0
.
.
.
19-20 2
20-21 2
21-22 1
.
.
.

Simplest would be to just use the starting hour to indicate a time interval:
# sample data
df = data.frame(time = Sys.time()+seq(1,10)*10000, runif(10) )
# summarize
library(dplyr)
df$hour = factor(as.numeric(format(df$time,"%H")), levels = seq(0,24))
df = df %>%
group_by(hour) %>%
summarize(count=n()) %>%
complete(hour, fill = list(count = 0))
Output:
# A tibble: 24 x 2
hour count
<fctr> <dbl>
1 0 0
2 1 1
3 2 0
4 3 0
5 4 1
6 5 0
7 6 1
8 7 0
9 8 0
10 9 1
# ... with 14 more rows
You can optionally add:
df$formatted = paste0(as.character(df$hour),"-",as.numeric(as.character(df$hour))+1)
at then end to get your desired format. Hope this helps!

Group IDs by dates

I have a data table of three columns id, dtstart, dtend. For example:
id start end
1 01/01/2015 31/01/2015
1 02/02/2015 28/02/2015
1 01/07/2016 31/07/2016
1 01/08/2016 31/08/2016
2 01/03/2015 31/03/2015
2 01/04/2015 30/04/2015
2 01/02/2016 28/02/2016
2 01/03/2016 31/03/2016
...
I need to create another data table grouped by id with the same columns but the new start date is the minimum date in the original start date and the new end date is the maximum date in the original dtend.
When there is a break of more then one day between an end date and the next start date then it should be grouped separately.
For example for the above the new table would be:
id start end
1 01/01/2015 28/02/2015
1 01/07/2016 31/08/2016
2 01/03/2015 30/04/2016
2 01/02/2016 31/03/2016
...
Do I need a for loop or is there a more efficient way (data table grouping for example)? The table is over 20 million rows with 100k+ unique ids.
Cheers
Andrew

This can be done using dplyr
dt.new <- dt %>%
arrange(id, start, end) %>%
mutate(gr = cumsum(lag(id, default = min(id)) != id |
as.numeric(difftime(start, lag(end, default = first(start)), units = 'days')) > 1)) %>%
group_by(id, gr) %>%
summarise(start = first(start),
end = last(end))
The result is:
Source: local data frame [6 x 4]
Groups: id [?]
id gr start end
<int> <int> <dttm> <dttm>
1 1 0 2015-01-01 2015-01-31
2 1 1 2015-02-02 2015-02-28
3 1 2 2016-07-01 2016-08-31
4 2 3 2015-03-01 2015-04-30
5 2 4 2016-02-01 2016-02-28
6 2 5 2016-03-01 2016-03-31
This works and doesn't match your output because you requested a one day margin (if you want two day margins then switch from >1 to >2), and 2016 was a leap year, which is in R's internal calendar. So the margin between 2/28/2016 and 3/1/2016 is 2 days.

Thanks again #akash87
For example row 6 below is within a month so it should still return one row for id 1 from 1/02/2006 to 30/09/2006 but it breaks into two, the first from 01/02/2006 to 12/06/2006 and then from 01/07/2006 to 30/09/2016
id dtstart dtend
1 01/02/2006 28/02/2006
1 01/03/2006 31/03/2006
1 01/04/2006 30/04/2006
1 01/05/2006 31/05/2006
1 01/06/2006 30/06/2006
1 10/06/2006 12/06/2006
1 01/07/2006 31/07/2006
1 01/08/2006 31/08/2006
1 01/09/2006 30/09/2006
2 01/04/2006 30/04/2006
2 01/05/2006 31/05/2006
2 01/09/2006 30/09/2006
2 01/10/2006 31/10/2006
So instead of returning
id start end
1 01/02/2006 30/09/2006
2 01/04/2006 31/05/2006
2 01/09/2006 31/10/2006
We have
id start end
1 01/02/2006 12/06/2006
1 01/07/2006 30/09/2006
2 01/04/2006 31/05/2006
2 01/09/2006 31/10/2006
Andrew

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr grouping and using a conditional from multiple columns - r

Related

Break up rows representing long time intervals into multiple rows

How can I use mutate to create a new column based only on a subset of other rows of a data frame?

lag date upon condition, carry over

Aggregating time on hourly basis and counting it

Group IDs by dates

Categories

Resources