Removing duplicates based on 3 columns in R - r

I have a data set of 300k+ cases and where a customer id may be repeated several times. Each customer has a date and rank associated with it as well. I'd like to be able to keep only unique customer ids sorted first by date then if there is a duplicate id with a duplicate date it would sort by rank (keeping the rank closest to 1). An example of my data is like this:
Customer.ID Date Rank
576293 8/13/2012 2
576293 11/16/2015 6
581252 11/22/2013 4
581252 11/16/2011 6
581252 1/4/2016 5
581600 1/12/2015 3
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1
Ideal outcome would then be like this:
Customer.ID Date Rank
576293 11/16/2015 6
581252 1/4/2016 5
581600 1/12/2015 2
582560 4/13/2016 1
591674 3/21/2012 6
586334 3/30/2014 1

With the desired output of the OP clarified:
We can also do this with base R, which will be faster than the below dplyr approach using group_by(Customer.ID) since we are not going to have to loop over all unique Customer.ID:
df <- df[order(-df$Customer.ID,as.Date(df$Date, format="%m/%d/%Y"),-df$Rank, decreasing=TRUE),]
res <- df[!duplicated(df$Customer.ID),]
Notes:
First, sort by Customer.ID in ascending order followed by Date in descending order followed by Rank in ascending order.
Remove the duplicates in Customer.ID so that only the first row for each Customer.ID is kept.
The result using your posted data as a data frame df (without converting the Date column) in ascending order for Customer.ID:
print(res)
## Customer.ID Date Rank
##2 576293 11/16/2015 6
##5 581252 1/4/2016 5
##7 581600 1/12/2015 2
##8 582560 4/13/2016 1
##10 586334 3/30/2014 1
##9 591674 3/21/2012 6
Data:
df <- structure(list(Customer.ID = c(591674L, 586334L, 582560L, 581600L,
581252L, 576293L), Date = c("3/21/2012", "3/30/2014", "4/13/2016",
"1/12/2015", "1/4/2016", "11/16/2015"), Rank = c(6L, 1L, 1L,
2L, 5L, 6L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(9L,
10L, 8L, 7L, 5L, 2L), class = "data.frame")
If you want to keep only the latest date (followed by lower rank) row for each Customer.ID, you can do the following using dplyr:
library(dplyr)
res <- df %>% group_by(Customer.ID) %>% arrange(desc(Date),Rank) %>%
summarise_all(funs(first)) %>%
ungroup() %>% arrange(Customer.ID)
Notes:
group_by Customer.ID and sort using arrange by Date in descending order and Rank by ascending order.
summarise_all to keep only the first row from each Customer.ID.
Finally, ungroup and sort by Customer.ID to get your desired result.
Using your data as a data frame df with the Date column converted to the Date class:
print(res)
### A tibble: 7 x 3
## Customer.ID Date Rank
## <int> <date> <int>
##1 576293 2015-11-16 6
##2 581252 2016-01-04 5
##3 581600 2015-01-12 2
##4 582560 2016-04-13 1
##5 586334 2014-03-30 1
##6 591674 2012-03-21 6
Data:
df <- structure(list(Customer.ID = c(576293L, 576293L, 581252L, 581252L,
581252L, 581600L, 581600L, 582560L, 591674L, 586334L), Date = structure(c(15565,
16755, 16031, 15294, 16804, 16447, 16447, 16904, 15420, 16159
), class = "Date"), Rank = c(2L, 6L, 4L, 6L, 5L, 3L, 2L, 1L,
6L, 1L)), .Names = c("Customer.ID", "Date", "Rank"), row.names = c(NA,
-10L), class = "data.frame")

Related

Filter values relative to values in another column using dplyr

I have a column in a dataframe and I would like to filter out any rows that are over or under two standard deviations from the mean.
As an example, I would hope to get two rows out of this (only the rows that fall between the low and high standard deviations:
group value low_sd high_sd
a 4 2 8
a 1 2 8
b 6 4 9
b 12 4 9
I was hoping to use dplyr::between .
clean_df <- df%>%
filter(between(value, low_sd, high_sd))
But it seems between only takes numerical values.
The between is not vectorized for left, right values. Instead, this can be done by using only the comparison (>/<) operators
library(dplyr)
df %>%
filter(value > low_sd, value <= high_sd)
# group value low_sd high_sd
#1 a 4 2 8
#2 b 6 4 9
But if we wrap with Vectorize, it would work as well
df %>%
filter(Vectorize(dplyr::between)(value, low_sd, high_sd))
# group value low_sd high_sd
#1 a 4 2 8
#2 b 6 4 9
data
df <- structure(list(group = c("a", "a", "b", "b"), value = c(4L, 1L,
6L, 12L), low_sd = c(2L, 2L, 4L, 4L), high_sd = c(8L, 8L, 9L,
9L)), class = "data.frame", row.names = c(NA, -4L))
Alternatively, you can use between() from data.table:
df %>%
filter(data.table::between(value, low_sd, high_sd))
group value low_sd high_sd
1 a 4 2 8
2 b 6 4 9
Or if you want to stick just to dplyr:
df %>%
rowwise() %>%
filter(dplyr::between(value, low_sd, high_sd))

Insert row to fill in missing date in R [duplicate]

This question already has answers here:
Insert rows for missing dates/times
(9 answers)
How to add only missing Dates in Dataframe
(3 answers)
Closed 3 years ago.
I have a dataset that look something like this:
Person date Amount
A 2019-01 900
A 2019-03 600
A 2019-04 300
A 2019-05 0
B 2019-04 1200
B 2019-07 800
B 2019-08 400
B 2019-09 0
As you'll notice in the "date" column, there are missing dates, such as '2019-02' for person A and '2019-05' and '2019-06' for person B. I would like to insert rows with the missing date and amount equal to the one before it (see expected result below).
I have tried performing group by but I don't know how to proceed from there. I've also tried converting the 'date' and 'amount' columns as lists, and from there fill in the gaps before putting them back to the dataframe. I was wondering if there is a more convenient way of doing this. In particular, getting the same results without having to extract lists from the original dataframe.
Ideally, I would want to having a dataframe that look something like this:
Person date Amount
A 2019-01 900
A 2019-02 900
A 2019-03 600
A 2019-04 300
A 2019-05 0
B 2019-04 1200
B 2019-05 1200
B 2019-06 1200
B 2019-07 800
B 2019-08 400
B 2019-09 0
I hope I was able to make my problem clear.
Thanks in advance.
We can first convert the date to actual date object (date1) by pasting "-01" at the end, then using complete we create a sequence of 1 month date objects for each Person. We then use fill to get Amount equal to the one before it and to get data in the original form we remove "-01" again from date1.
library(dplyr)
library(tidyr)
df %>%
mutate(date1 = as.Date(paste0(date, "-01"))) %>%
group_by(Person) %>%
complete(date1 = seq(min(date1), max(date1), by = "1 month")) %>%
fill(Amount) %>%
mutate(date = sub("-01$", "", date1)) %>%
select(-date1)
# Person date Amount
# <fct> <chr> <int>
# 1 A 2019-01 900
# 2 A 2019-02 900
# 3 A 2019-03 600
# 4 A 2019-04 300
# 5 A 2019-05 0
# 6 B 2019-04 1200
# 7 B 2019-05 1200
# 8 B 2019-06 1200
# 9 B 2019-07 800
#10 B 2019-08 400
#11 B 2019-09 0
data
df <- structure(list(Person = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), date = structure(c(1L,
2L, 3L, 4L, 3L, 5L, 6L, 7L), .Label = c("2019-01", "2019-03",
"2019-04", "2019-05", "2019-07", "2019-08", "2019-09"), class = "factor"),
Amount = c(900L, 600L, 300L, 0L, 1200L, 800L, 400L, 0L)),
class = "data.frame", row.names = c(NA, -8L))

Aggregate/Sum data set by week and by product in R

I have a very large data set that I would like to aggregate both by week/month and by product(a few thousand). Is there a way to do so with a data set in the following format?
Date product product2 product3
03/03/2011 1 0 7
04/08/2011 3 8 2
03/05/2015 6 3 89
03/01/2017 1 0 2
03/03/2017 6 1 6
which would yield the following:
Date product product2 product3
wk1-032011 1 0 7
wk2-042011 3 8 2
wk1-032015 6 3 89
wk1-032017 7 1 8
df <- structure(list(Date = c("03/03/2011", "04/04/2011", "03/05/2015", "03/01/2017", "03/03/2017"),
product= c(1L, 3L, 6L, 1L, 6L),
product2= c(0L, 8L, 3L, 0L, 1L),
product3= c(7L, 2L, 89L, 2L, 6L)),
.Names= c("Date", "product", "product2", "product3"),
class= "data.frame", row.names=c(NA, -5L))
In base R, you can use as.Date to convert your character df$Date into a Date variable and then use format to with the proper formatting to convert the date into a character variable indicating weekly dates. aggregate is then used to perform the aggregation by the new variable.
aggregate(df[2:4], list("weeks"=format(as.Date(df$Date, "%m/%d/%Y"), "%Y-%W")), FUN=sum)
weeks product product2 product3
1 2011-09 1 0 7
2 2011-14 3 8 2
3 2015-09 6 3 89
4 2017-09 7 1 8
See ?strptime for other date conversions.
As #akrun mentions in the comments, the data.table analog to the above base R code is
library(data.table)
setDT(df)[, lapply(.SD, sum),
by=.(weeks = format(as.IDate(Date, "%m/%d/%Y"), "%Y-%W"))]
Here, setDT converts the data.frame into a data.table, lapply... calculates the sum where .SD stands for the data.table. This sum is calculated by each unique element that is produced from format(as.IDate(Date, "%m/%d/%Y"), "%Y-%W") where the conversion uses data.table's as.IDate in place of the base R as.Date.

summarize in dplyr with the maximum value of the date - R

I have the following data,
data
date ID value1 value2
2016-04-03 1 0 1
2016-04-10 1 6 2
2016-04-17 1 7 3
2016-04-24 1 2 4
2016-04-03 2 1 5
2016-04-10 2 5 6
2016-04-17 2 9 7
2016-04-24 2 4 8
Now I want to group by ID and find the mean of value2 and latest value of value1. Latest value in the sense, I would like to get the value of latest date i.e. here I would like to get the value1 for corresponding value of 2016-04-24 for each IDs. My output should be like,
ID max_value1 mean_value2
1 2 2.5
2 4 6.5
The following is the command I am using,
data %>% group_by(ID) %>% summarize(mean_value2 = mean(value2))
But I am not sure how to do the first one. Can anybody help me in getting the latest value of value1 while summarizing in dplyr?
One way would be the following. My assumption here is that date is a date object. You want to arrange the order of date for each ID using arrange. Then, you group the data by ID. In summarize, you can use last() to take the last value1 for each ID.
arrange(data,ID,date) %>%
group_by(ID) %>%
summarize(mean_value2 = mean(value2), max_value1 = last(value1))
# ID mean_value2 max_value1
# <int> <dbl> <int>
#1 1 2.5 2
#2 2 6.5 4
DATA
data <- structure(list(date = structure(c(16894, 16901, 16908, 16915,
16894, 16901, 16908, 16915), class = "Date"), ID = c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L), value1 = c(0L, 6L, 7L, 2L, 1L, 5L, 9L,
4L), value2 = 1:8), .Names = c("date", "ID", "value1", "value2"
), row.names = c(NA, -8L), class = "data.frame")
Here is an option with data.table
library(data.table)
setDT(data)[, .(max_value1 = value1[which.max(date)],
mean_value2 = mean(value2)) , by = ID]
# ID max_value1 mean_value2
#1: 1 2 2.5
#2: 2 4 6.5
You can do this using the function nth in dplyr which finds the nth value of a vector.
data %>% group_by(ID) %>%
summarize(max_value1 = nth(value1, n = length(value1)), mean_value2 = mean(value2))
This is based on the assumption that the data is ordered by date as in the example; otherwise use arrange as discussed above.

Cohort Data transformation

I have the following data:
signup_date purchase_date nbr_purchase
2010-12-12 7 2
2011-01-03 4 1
2010-11-28 6 2
2011-01-05 19 9
2010-11-10 26 3
2010-11-25 11 2
Where each row corresponds to a customer, signup_date is sign up date, purchase_date is number of days elapsed from sign up and first purchase, nbr_purchase is number of items purchased. I would like to carry cohort analysis and transform the data to look like:
cohort signed_up active_m0 active_m1 active_m2
2011-10 12345 10432 8765 6754
2011-11 12345 10432 8765 6754
2011-12 12345 10432 8765 6754
Cohort here is in “YYYY-MM” format, signed_up is the number of users who have created accounts in the given month, active_m0 – number of users who made first purchase in the same month as they registered, active_m1 – number of users who made first purchase in the following month, and so forth.
Assuming your input data in in the following format
dd<-structure(list(signup_date = structure(c(14955, 14977, 14941,
14979, 14923, 14938), class = "Date"), purchase_date = c(7L,
4L, 6L, 19L, 26L, 11L), nbr_purchase = c(2L, 1L, 2L, 9L, 3L,
2L)), .Names = c("signup_date", "purchase_date", "nbr_purchase"
), row.names = c(NA, -6L), class = "data.frame")
Then you can do
dd$cohort <- strftime(dd$signup_date, "%Y-%m")
dd$interval <- paste0("active_m",(dd$purchase_date %/% 10) +1)
tt<-with(dd, table(cohort, interval))
cbind(tt, signed_up=rowSums(tt))
to get the data you need
active_m1 active_m2 active_m3 signed_up
2010-11 1 1 1 3
2010-12 1 0 0 1
2011-01 1 1 0 2
note that here I used 10 day intervals rather than 30 day since you didn't have any purchase observations that were over 30 days from signup. So for your real data, change %/% 10 to %/% 30.

Resources