How to count the number of customer per month in R? - r

So I have a table of customers with the respective date as below:
ID
Date
1
2019-04-17
4
2019-05-12
1
2019-04-25
2
2019-05-19
I just want to count how many Customer is there for each month-year like below:
Month-Year
Count of Customer
Apr-19
2
May-19
2
EDIT:
Sorry but I think my Question should be clearer.
The same customer can appear more than once in a month and would be counted as 2 customer for the same month. I would basically like to find the number of transaction per month based on customer id.
My assumed approach would be to first change the date into a month-year format? And then I count each customer and grouped it for each month? but I am not sure how to do this in R. Thank you!

You can use count -
library(dplyr)
df %>% count(Month_Year = format(as.Date(Date), '%b-%y'))
# Month_Year n
#1 Apr-19 2
#2 May-19 2
Or table in base R -
table(format(as.Date(df$Date), '%b-%y'))
#Apr-19 May-19
# 2 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))

We can use zoo::as.yearmon
library(dplyr)
df %>%
count(Date = zoo::as.yearmon(Date))
Date n
1 Apr 2019 2
2 May 2019 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))

Related

In R, make a conditional indicator variable based on (a) the first instance of a record type and (b) a date difference

Background
Here's a df with some data in it from a Costco-like members-only big-box store:
d <- data.frame(ID = c("a","a","b","c","c","d"),
purchase_type = c("grocery","grocery",NA,"auto","grocery",NA),
date_joined = as.Date(c("2014-01-01","2014-01-01","2013-04-30","2009-03-08","2009-03-08","2015-03-04")),
date_purchase = as.Date(c("2014-04-30","2016-07-08","2013-06-29","2015-04-07","2017-09-10","2017-03-10")),
stringsAsFactors=T)
d <- d %>%
mutate(date_diff = d$date_purchase - d$date_joined)
This yields the following table:
As you can see, it's got a member ID, purchase types based on the broad category of what people bought, and two dates: the date the member originally became a member, and the date of a given purchase. I've also made a variable date_diff to tally the time between a given purchase and the beginning of membership.
The Problem
I'd like to make a new variable early_shopper that's marked 1 on all of a member's purchases if
That member's first purchase was made within a year of joining (so date_diff <= 365 days).
This first purchase doesn't have an NA in purchase_type.
If these criteria aren't met, give a 0.
What I'm looking for is a table that looks like this:
Note that Member a is the only "true" early_shopper: their first purchase is non-NA in purchase_type, and only 119 days passed between their joining the store and making a purchase there. Member b looks like they could be based on my date_diff criterion, but since they don't have a non-NA value in purchase_type, they don't count as an early_shopper.
What I've Tried
So far, I've tried using mutate and first functions like this:
d <- d %>%
mutate(early_shopper = if_else(!is.na(first(purchase_type,order_by = date_joined)) & date_diff < 365, 1, 0))
Which gives me this:
Something's kinda working here, but not fully. As you can see, I get the correct early_shopper = 1 in Member a's first purchase, but not their second. I also get a false positive with member b, who's marked as an early_shopper when I don't want them to be (because their purchase_type is NA).
Any ideas? I can further clarify if need be. Thanks!
You could use
library(dplyr)
d %>%
mutate(date_diff = date_purchase - date_joined) %>%
group_by(ID, purchase_type) %>%
arrange(ID, date_joined) %>%
mutate(
early_shopper = +(!is.na(first(purchase_type)) & date_diff <= 365)
) %>%
group_by(ID) %>%
mutate(early_shopper = max(early_shopper)) %>%
ungroup()
which returns
# A tibble: 6 x 6
ID purchase_type date_joined date_purchase date_diff early_shopper
<fct> <fct> <date> <date> <drtn> <int>
1 a grocery 2014-01-01 2014-04-30 119 days 1
2 a grocery 2014-01-01 2016-07-08 919 days 1
3 b NA 2013-04-30 2013-06-29 60 days 0
4 c auto 2009-03-08 2015-04-07 2221 days 0
5 c grocery 2009-03-08 2017-09-10 3108 days 0
6 d NA 2015-03-04 2017-03-10 737 days 0
If you want the early_shopper column to be boolean/logical, just remove the +.
Data
I used this data, here the date_joined for b is 2013-04-30 like shown in your images and not like in your actual data posted.
structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L), .Label = c("a",
"b", "c", "d"), class = "factor"), purchase_type = structure(c(2L,
2L, NA, 1L, 2L, NA), .Label = c("auto", "grocery"), class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311,
16498), class = "Date"), date_purchase = structure(c(16190,
16990, 15885, 16532, 17419, 17235), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))
Here is my approach using a join to get the early_shopper value to be the same for all rows of the same ID.
library(dplyr)
d <- structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L),
.Label = c("a","b", "c", "d"),
class = "factor"),
purchase_type = structure(c(2L, 2L, NA, 1L, 2L, NA),
.Label = c("auto", "grocery"),
class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311, 16498),
class = "Date"),
date_purchase = structure(c(16190, 16990, 15885, 16532, 17419, 17235),
class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
d %>%
inner_join(d %>%
mutate(date_diff = d$date_purchase - d$date_joined) %>%
group_by(ID) %>%
slice_min(date_diff) %>%
transmute(early_shopper = if_else(!is.na(first(purchase_type,
order_by = date_joined)) &
date_diff < 365, 1, 0)) %>%
ungroup()
)
ID purchase_type date_joined date_purchase early_shopper
1 a grocery 2014-01-01 2014-04-30 1
2 a grocery 2014-01-01 2016-07-08 1
3 b <NA> 2013-04-30 2013-06-29 0
4 c auto 2009-03-08 2015-04-07 0
5 c grocery 2009-03-08 2017-09-10 0
6 d <NA> 2015-03-04 2017-03-10 0

Creating intervals of 5 seconds for data and grouping them

I have a dataframe with one column being time and the other column being price. it looks like this
Time | Price
15:31:01 | 2
15:31:03 | 4
15:31:05 | 3
15:31:08 | 1
15:31:10 | 9
I would like to group the data in intervals of 5 seconds and take the average of each 5 seconds interval to look like this
Time | Price
15:31:05 | 4.5
15:31:10 | 5
I know it is possible to do this via dplyr for grouping them in buckets of minutes where i extract the minute data for each row and then group them by minutes. But is there any way to do this for a custom second interval? In this case i am looking at 5 second interval.
We can convert the Time column to POSIXct format and cut it into intervals of "5 secs" and use that as a group and take mean in each group
aggregate(Price~ group, transform(df,
group = cut(as.POSIXct(Time, format = "%T"), breaks = "5 secs")), mean)
# group Price
#1 2019-09-17 15:31:01 3
#2 2019-09-17 15:31:06 5
data
df <- structure(list(Time = structure(1:5, .Label = c("15:31:01", "15:31:03",
"15:31:05", "15:31:08", "15:31:10"), class = "factor"), Price = c(2L,
4L, 3L, 1L, 9L)), class = "data.frame", row.names = c(NA, -5L))

define an indicator for overlapping time interval [duplicate]

This question already has an answer here:
finding overlapping time between start time and end time of individuals in a group
(1 answer)
Closed 3 years ago.
I have
household person start time end time
1 1 07:45:00 21:45:00
1 2 09:45:00 17:45:00
1 3 22:45:00 23:45:00
1 4 08:45:00 01:45:00
1 1 23:50:00 24:00:00
2 1 07:45:00 21:45:00
2 2 016:45:00 22:45:00
I want to find a column to find overlapping time between family members.
The indicator is: if a person's start and end time has intersection with another member's is 1 otherwise 0
In the above example first family, the time of first, second and forth persons have intersection so indicator is 1 and third and fifth rows doesn't have intersection with non of the other people in the household.
output:
household person start time end time overlap
1 1 07:45:00 21:45:00 1
1 2 09:45:00 17:45:00 1
1 3 22:45:00 23:45:00 0
1 4 08:45:00 01:45:00 1
1 1 23:50:00 24:00:00 0
2 1 07:45:00 21:45:00 1
2 2 016:45:00 22:45:00 1
data with dput format:
structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L), PERNO = c(1,
1, 1, 1, 1, 1), arr = structure(c(30300, 35280, 37200, 32400,
34200, 39600), class = c("hms", "difftime"), units = "secs"),
dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400
), class = c("hms", "difftime"), units = "secs")), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
I have tried a tidyverse solution:
library(tidyverse)
df = structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L),
PERNO = c(1:3, 1:3), arr = structure(c(30300, 35280, 37200, 32400,
34200, 39600), class = c("hms", "difftime"), units = "secs"),
dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400), class = c("hms", "difftime"), units = "secs")), class = c("grouped_df","tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
Then, I added:
df %>% group_by(SAMPN) %>%
mutate(
arr_min = mapply(function(x) min(arr[-x]), 1:n()),
dep_max = mapply(function(x) max(dep[-x]), 1:n()),
overlap = ifelse(arr<arr_min | dep>dep_max, 0, 1)
)
You will get:
SAMPN PERNO arr dep arr_min dep_max overlap
<int> <int> <time> <time> <dbl> <dbl> <dbl>
1 1 1 08:25 09:30 35280 61800 0
2 1 2 09:48 10:05 30300 61800 1
3 1 3 10:20 17:10 30300 36300 0
4 2 1 09:00 09:20 34200 50400 0
5 2 2 09:30 10:30 32400 50400 1
6 2 3 11:00 14:00 32400 37800 0
You basically compare the current arr and dep with arr_min (min(arr) value excluding the current case) and dep_max (max(dep) excluding the current case).
Tidyverse solution
Here's a solution in tidyverse syntax. The basic idea is the same. We perform a many to many merge matching on household (sampn in your current example data) and remove the cases of comparing a person to themself (perno). We check for overlaps, then collapse to a single record per household and person. Note that this code will error if all records have the same perno.
compare <-
df %>%
left_join(df %>%
rename(compare_PERNO = PERNO,
compare_arr = arr,
compare_dep = dep), by = ("SAMPN")) %>%
filter(PERNO != compare_PERNO) %>%
mutate(overlap = arr <= compare_dep & dep >= compare_arr) %>%
group_by(SAMPN, PERNO) %>%
summarize(overlap = max(overlap))
SQL Solution with household grouping
Grouping the data by household actually makes this problem slightly easier. Again, I'm using SQL to solve it. In the inner SQL statement I do a many to many merge matching all members of a household to all other members, I remove any cases of matching a person to themself. Then, in the outer SQL statement we reduce to one record per household and person, which indicates if they ever overlapped.
df = data.frame(
household = c(rep(1,5), rep(2,2)),
person = c(1:5, 1:2),
start_time=as.Date(c("2017-05-31","2018-01-14", "2019-02-03", "2018-01-19", "2019-04-17",
"2018-02-03", "2018-03-03"),
format="%Y-%m-%d"),
end_time=as.Date(c("2018-01-17", "2018-01-20", "2019-04-15", "2018-02-20", "2019-05-17",
"2019-03-03", "2019-03-03"),
format="%Y-%m-%d"))
library(sqldf)
compare <- sqldf(
"
SELECT * FROM (
SELECT L.* ,
CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
ELSE 0 END AS overlap
FROM df as L
LEFT JOIN df as R ON L.household = R.household
WHERE L.person != R.person
)
GROUP BY household, person
HAVING overlap = MAX(overlap)
"
)
SQL Solution without household grouping
This is an SQL solution to your problem. I do a keyless many to many merge to compare each row to every other row (but dont compare a row to itself), then I parse the big data frame down to a single record per ID that records whether any matches were found. Your data isn't quite a reprex (use the dput function in R), so I used an example dataset I had lying around. If you have trouble adapting this to your exact data, post reproducible data and I can help you out.
df = data.frame(
id = 1:3,
start_time=as.Date(c("2017-05-31","2018-01-14", "2018-02-03"), format="%Y-%m-%d"),
end_time=as.Date(c("2018-01-17", "2018-01-20", "2018-04-17"), format="%Y-%m-%d"))
library(sqldf)
compare <- sqldf(
"
SELECT * FROM (
SELECT L.* ,
CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
ELSE 0 END AS overlap
FROM df as L
CROSS JOIN df as R
WHERE L.id != R.id
)
GROUP BY ID
HAVING overlap = MAX(overlap)
"
)

How to divide contents of one column by different values, conditional on contents of a second column?

I've got a data frame that looks like something along these lines:
Day Salesperson Value
==== ============ =====
Monday John 40
Monday Sarah 50
Tuesday John 60
Tuesday Sarah 30
Wednesday John 50
Wednesday Sarah 40
I want to divide the value for each salesperson by the number of times that each of the days of the week has occurred. So: There have been 3 Monday, 3 Tuesdays, and 2 Wednesdays — I don't have this information digitally, but can create a vector along the lines of
c(3, 3, 2)
How can I conditionally divide the Value column based on the number of times each day occurs?
I've found an inelegant solution, which entails copying the Day column to a temp column, replacing each of the names of the week in the new column with the number of times each day occurs using
df$temp <- sub("Monday, 3, df$temp)
but doing this seems kinda clunky. Is there a neat way to do this?
Suppose your auxiliary data is in another data.frame:
Day N_Day
1 Monday 3
2 Tuesday 3
3 Wednesday 2
The simplest way would be to merge:
DF_new <- merge(DF, DF2, by="Day")
DF_new$newcol <- DF_new$Value / DF_new$N_Day
which gives
Day Salesperson Value N_Day newcol
1 Monday John 40 3 13.33333
2 Monday Sarah 50 3 16.66667
3 Tuesday John 60 3 20.00000
4 Tuesday Sarah 30 3 10.00000
5 Wednesday John 50 2 25.00000
6 Wednesday Sarah 40 2 20.00000
The mergeless shortcut is
DF$newcol <- DF$Value / DF2$N_Day[match(DF$Day, DF2$Day)]
Data:
DF <- structure(list(Day = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label =
c("Monday",
"Tuesday", "Wednesday"), class = "factor"), Salesperson = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("John", "Sarah"), class = "factor"),
Value = c(40L, 50L, 60L, 30L, 50L, 40L)), .Names = c("Day",
"Salesperson", "Value"), class = "data.frame", row.names = c(NA,
-6L))
DF2 <- structure(list(Day = structure(1:3, .Label = c("Monday", "Tuesday",
"Wednesday"), class = "factor"), N_Day = c(3, 3, 2)), .Names = c("Day",
"N_Day"), row.names = c(NA, -3L), class = "data.frame")
You can use the library dplyr to merge your data frame with the frequency of each day.
df <- data.frame(
Day=c("Monday","Monday","Tuesday","Tuesday","Wednesday","Wednesday"),
Salesperson=c("John","Sarah","John","Sarah","John","Sarah"),
Value=c(40,50,60,30,50,40), stringsAsFactors=F)
aux <- data.frame(
Day=c("Monday","Tuesday","Wednesday"),
freq=c(3,3,2)
)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value2=Value/n)
To create this auxiliary table with the count of days that appear in your original data instead of doing it manually. You could use:
aux <- df %>% group_by(Day) %>% summarise(n=n())
> output
Day Salesperson Value n Value2
1 Monday John 40 2 20
2 Monday Sarah 50 2 25
3 Tuesday John 60 2 30
4 Tuesday Sarah 30 2 15
5 Wednesday John 50 2 25
6 Wednesday Sarah 40 2 20
If you want to substitute the actual valuecolumn, then use mutate(Value=Value/n) and to remove the additional columns, you can add a select(-n)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value=Value/n) %>% select(-n)

Convert factor type time into number of minutes

There is a column in my dataset that contains time in the format 00:20:10. I have two questions. First, when I import it into R using read.xlsx2(), this column is converted to factor type. How can I convert it to time type?
Second, I want to calculate each person's total time in number of minutes.
ID Time
1 00:10:00
1 00:21:30
2 00:30:10
2 00:04:10
The output I want is:
ID Total.time
1 31.5
2 34.3
I haven't dealt with time issue before, and I hope someone would recommend some packages as well.
You could use times() from the chron package to convert the Time column to "times" class. Then aggregate() to sum the times, grouped by the ID column. This first block will give us actual times in the result.
library(chron)
df$Time <- times(df$Time)
aggregate(list(Total.Time = df$Time), df[1], sum)
# ID Total.Time
# 1 1 00:31:30
# 2 2 00:34:20
For decimal output, we can employ minutes() and seconds(), also from chron.
aggregate(list(Total.Time = df$Time), df[1], function(x) {
minutes(s <- sum(x)) + (seconds(s) / 60)
})
# ID Total.Time
# 1 1 31.50000
# 2 2 34.33333
Furthermore, we can also use data.table for improved efficiency.
library(data.table)
setDT(df)[, .(Total.Time = minutes(s <- sum(Time)) + (seconds(s) / 60)), by = ID]
# ID Total.Time
# 1: 1 31.50000
# 2: 2 34.33333
Data:
df <- structure(list(ID = c(1L, 1L, 2L, 2L), Time = structure(c(2L,
3L, 4L, 1L), .Label = c("00:04:10", "00:10:00", "00:21:30", "00:30:10"
), class = "factor")), .Names = c("ID", "Time"), class = "data.frame", row.names = c(NA,
-4L))

Resources