I am working with a call log data set from a telephone hotline service. There are three call outcomes: Answered, Abandoned & Engaged. I am trying to find out the average time taken by each caller to contact the hotline again if they abandoned the previous call. The time difference can be either seconds, minutes, hours or days but I would like to get all four if possible.
Here is some mock data with the variables I am working with:-
library(wakefield)#for generating the Status variable
library(dplyr)
library(stringi)
library(Pareto)
library(uuid)
n_users<-1300
n_rows <- 365000
set.seed(1)
#data<-data.frame()
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(rep(Date,each=1000),replace = T)
u <- runif(length(Date), 0, 60*60*12) # "noise" to add or subtract from some timepoint
CallDateTime<-as.POSIXlt(u, origin = paste0(Date,"00:00:00"))
CallDateTime
CallOutcome<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=length(Date))
CallOutcome
data<-data.frame(Date,CallDateTime,CallOutcome)
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
data$CallerId <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
data<-data%>%arrange(CallDateTime)
head(data)
So to reiterate, if a caller abandons their call (represented by "Abandoned" in the CallOutcome column), I would like to know the average time taken for the caller to make another call to the service, in the four time units I have mentioned. Any pointers on how I can achieve this would be great :)
Keep rows in the data where the current row is "Abandoned" and the next row is not "Abandoned" for each ID. Find difference in time between every 2 rows to get time required for the caller to make another call to service after it was abandoned, take average of each of the duration to get average time.
library(dplyr)
data %>%
#Test the answer on smaller subset
#slice(1:1000) %>%
arrange(CallerId, CallDateTime) %>%
group_by(CallerId) %>%
filter(CallOutcome == 'Abandoned' & dplyr::lead(CallOutcome) != 'Abandoned' |
CallOutcome != 'Abandoned' & dplyr::lag(CallOutcome) == 'Abandoned') %>%
mutate(group = rep(row_number(), each = 2, length.out = n())) %>%
group_by(group, .add = TRUE) %>%
summarise(avg_sec = difftime(CallDateTime[2], CallDateTime[1], units = 'secs')) %>%
mutate(avg_sec = as.numeric(mean(avg_sec)),
avg_min = avg_sec/60,
avg_hour = avg_min/60,
avg_day = avg_hour/24) -> result
result
First, I would create the lead variable (basically calculate what is the "next" value by group. Then it's just as easy as using whatever unit you want for difftime. A density plot can help you analyze these differences, as shown below.
data <-
data %>%
group_by(CallerId) %>%
mutate(CallDateTime_Next = lead(CallDateTime)) %>%
ungroup() %>%
mutate(
diff_days = difftime(CallDateTime_Next, CallDateTime, units = 'days'),
diff_hours = difftime(CallDateTime_Next, CallDateTime, units = 'hours'),
diff_mins = difftime(CallDateTime_Next, CallDateTime, units = 'mins'),
diff_secs = difftime(CallDateTime_Next, CallDateTime, units = 'secs')
)
data %>%
filter(CallOutcome == 'Abandoned') %>%
ggplot() +
geom_density(aes(x = diff_days))
Related
I'm looking for some help understanding piping and summarizing functions using dplyr. I feel like my coding is a bit verbose and could be simplified. So there is a couple of questions in here because I know I'm missing some concepts, but I'm not quite sure where that lack of knowledge is. I've included my full code at the bottom. Thanks in advance as this is a bit larger ask.
1a. From the example data below and using dplyr, is there a way to calculate the games(dates) per team without using an intermediate table?
1b. I've included my original way to calculate n_games which didn't work. Why?
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK--WHY?]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
Below is my process of creating a summary table. I understand that piping is meant to cut out the creation of some intermediate variables/tables. Where could I combine steps below to create the final table with a minimum number of intermediate steps.
# load librarys ------------------------------------------------
library(tidyverse)
# build sample shot data ---------------------------------------
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# calculate data ----------------------------------------------
# since every row is a shot, the following function counts shots for ea. team
n_shots <- shot_df_ex %>%
count(Team_Name) %>%
rename(N_Shots = n)
n_shots
# do the same for goals for each team
n_goals <- shot_df_ex %>%
filter(Type == "goal") %>%
count(Team_Name,sort = T) %>%
rename(N_Goals = n) %>%
arrange(Team_Name)
n_goals
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
# combine data ------------------------------------------------
# combine columns and add average shots per game
shot_table_ex <- n_games %>%
left_join(n_shots) %>%
left_join(n_goals)
# final table with final average calculations
shot_table_ex <- shot_table_ex %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1)) %>%
arrange(Team_Name)
shot_table_ex
For 1a, you can just pipe straight from the tibble() function to count(). ie.
tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
count(Team_Name,Date)
In 1b, count() is using your column n (ie. the number of shots) as a weighting variable so is summing the total number of shots per team, not the number of rows. It prints a message telling you this:
Using `n` as weighting variable i Quiet this message with `wt = n` or count rows with `wt = 1`
Using count(Team_Name, wt=n()) will give the behaviour you want.
Edit: part 2
shot_table_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
group_by(Team_Name) %>%
summarise(n_shots = n(),
n_goals = sum(Type == "goal"),
n_games = n_distinct(Date)) %>%
mutate(Shots_per_Game = round(n_shots / n_games, 1),
Goals_per_Game = round(n_goals / n_games, 1))
1a. From the example data below and using dplyr, is there a way to calculate the games(dates) per team without using an intermediate table?
This is how I would do it:
shot_df_ex %>%
distinct(Team_Name, Date) %>% #Keeps only the cols given and one of each combo
count(Team_Name)
You can also use unique:
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(N_Games = length(unique(Date))
1b. I've included my original way to calculate n_games which didn't
work. Why?
Your code is working for me. Did you perhaps save over the intermediate table? It's counting the expected 6 per team.
Below is my process of creating a summary table. I understand that piping is meant to cut out the creation of some intermediate
variables/tables. Where could I combine steps below to create the
final table with a minimum number of intermediate steps?
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(
N_Games = length(unique(Date)),
N_Shots = sum(Type == "shot"),
N_Goals = sum(Type == "goal")
) %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1))
You can use multiple summarize steps at a time as long as you don't need to change your grouping. We're taking advantage here (in the sum calls) of the interpretation of True as 1 and False as 0. length will of course give us the length of the vector produced by unique.
this (count) works, but isn't count() just a quicker way to run group_by() %>% summarise()?
count is just a combination of group_by(col) %>% tally() and tally is essentially summarize(x=n()) so yes. :)
I have a "trips" dataset that includes a unique trip id, and a start and end time (the specific hour and minute) of the trips. These trips were all taken on the same day. I am trying to determine the number of cars on the road at any given time and plot it as a line graph using ggplot in R. In other words, a car is "on the road" at any time in between its start and end time.
The most similar example I can find uses the following structure:
yearly_counts <- trips %>%
count(year, trip_id)
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line()
Would the best approach be to modify this structure have an "minutesByHour_count" variable that has a count for every minute of every hour? This seems inefficient to me, and still doesn't solve the problem of getting the counts from the start/end time.
Is there any easier way to do this?
Here's an example based on counting each start as an additional car, and each end as a reduction in the count:
library(tidyverse)
df %>%
gather(type, time, c(start_hour, end_hour)) %>%
mutate(count_chg = if_else(type == "start_hour", 1, -1)) %>%
arrange(time) %>%
mutate(car_count = cumsum(count_chg)) %>%
ggplot(aes(time, car_count)) +
geom_step()
Sample data:
df <- data.frame(
uniqueID = 1:60,
start_hour = seq(8, 12, length.out = 60),
dur_hour = 0.05*1:60
)
df$end_hour = df$start_hour + df$dur_hour
df$dur_hour = NULL
I've created a few custom tweaks to zoo::na.locf before, but this one is driving me nuts. I need a function that will carry forward the last observation of a column only if the values in another column haven't changed; and it all has to be grouped by a primary key. For example:
library(dplyr)
set.seed(20180409)
data <- data.frame(Id = rep(1:10, each = 24),
Date = rep(seq.Date(as.Date("2016-01-01"), as.Date("2017-12-01"),
by = "month"), 10),
FillCol = replace(runif(240), runif(240) < 0.9, NA),
CheckCol = rep(letters[1:7], each = 7, length.out = 240))
data <- data %>%
group_by(Id) %>%
mutate(CheckColHasChanged = replace(lag(CheckCol) != CheckCol,
is.na(lag(CheckCol) != CheckCol), TRUE),
FillColIsNA = is.na(FillCol))
So I'm trying to carry foward any observations of FillCol, but once we hit an observation where CheckColHasChanged, stop the carry forward until the next valid observation in FillCol. I can do it in a loop but I'm struggling to do it properly.
Fill <- TRUE #indicator for whether or not I should be carrying forward
for(row in 2:nrow(data)){
#if the CheckCol has changed, don't fill
if(data$CheckColHasChanged[row]){Fill <- FALSE}
#if we should fill and still have the same Id, then fill from the last obs
if(Fill & data$Id[row] == data$Id[row - 1]){
data$FillCol[row] <- data$FillCol[row - 1]
}else{ #if there's a valid obs in FillCol, set the indicator back to true
if(!data$FillColIsNA[row]){Fill <- TRUE}
}
}
Any help would be greatly appreciated!
Comment to answer: this is just filling in by both Id and CheckCol:
data %>% group_by(Id, CheckCol) %>%
mutate(result = zoo::na.locf(FillCol, na.rm = FALSE))
The way you describe CheckCol, it is treated just like an ID. There's no difference between "only if the values in another column haven't changed" and "grouped by a primary key". You just have two columns to group by.
I have working code below, which does what I am after, and does it fine for a test subset of +- 1.000 records. However, in the actual dataset, I have about half a million rows, where suddenly the code takes up over five minutes. Could anyone tell me why or how to improve the code?
The end result I need is to keep only the first value of duplicated ID's, but for each year this should be renewed (i.e. double values are fine if they are in different years, but not in the same year).
Test %>%
group_by(year, id) %>%
mutate(is_duplicate = duplicated(id)) %>%
mutate(oppervlakt = ifelse(is_duplicate == FALSE, oppervlakt, 0))%>%
select(-is_duplicate)
I think you could remove id from grouping and should get same results. See this example:
library(dplyr)
# some sample data:
n_rows <- 1E6
df <- data.frame(year = sample(x = c(2000:2018), size = n_rows, replace = TRUE),
id = sample(x = seq_len(1000), size = n_rows, replace = TRUE),
oppervlakt = rnorm(n = n_rows))
# Roughly 1 second:
system.time(df_slow <- df %>% group_by(year, id) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
# Roughly .1 second:
system.time(df_fast <- df %>% group_by(year) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
all.equal(df_slow, df_fast)
[1] TRUE
I've measured N20 flux from soil at multiple timepoints in the day (not equally spaced). I'm trying to calculate the total N20 flux from soil for a subset of days by finding the area under the curve for the given day. I know how to do this when using only measures from the given day, however, I'd like to include the last measure of the previous day and the first measure of the following day to improve the estimation of the curve.
Here's an example to give a more concrete idea:
library(MESS)
library(lubridate)
library(dplyr)
Generate Reproducible Example
datetime <- seq(ymd_hm('2015-04-07 11:20'),ymd('2015-04-13'), by = 'hours')
dat <- data.frame(datetime, day = day(datetime), Flux = rnorm(n = length(datetime), mean = 400, sd = 20))
useDate <- data.frame(day = c(7:12), DateGood = c("No", "Yes", "Yes", "No", "Yes", "No"))
dat <- left_join(dat, useDate)
Some days are "bad" (too many missing measures) and some are "Good" (usable). The goal is to filter all measurements (rows) that occurred on a "Good" day as well as the last measurement from the day before and the first measurement on the next day.
out <- dat %>%
mutate(lagDateGood = lag(DateGood),
leadDateGood = lead(DateGood)) %>%
filter(lagDateGood != "No" | leadDateGood != "No")
Now I need to calculate the area under the curve - this is not correct
out2 <- out %>%
group_by(day) %>%
mutate(hourOfday = hour(datetime) + minute(datetime)/60) %>%
summarize(auc = auc(x = hourOfday, y = Flux, from = 0, to = 24, type = "spline"))
The trouble is that I don't include the measurements on end of previous day and start of following day when calculating AUC. Also, I get an estimate of flux for day 10, which is a "bad" day.
I think the crux of my question has to do with groups. Some measurements need to be in multiple groups (for example the last measurement on day 8 would be used in estimating AUC for day 8 and day 9). Do you have suggestions for how I could form new groups? Or might there be a completely different way to achieve the goal?
For what it's worth, this is what I did. The answer really lies in the question I linked to in the comments. Starting with the dataframe "out" from the question:
#Now I need to calculate the area under the curve for each day
n <- nrow(out)
extract <- function(ix) out[seq(max(1, min(ix)-1), min(n, max(ix) + 1)), ]
res <- lapply(split(1:n, out$day), extract)
calcTotalFlux <- function(df) {
if (nrow(df) < 10) { # make sure the day has at least 10 measures
NA
} else {
day_midnight <- floor_date(df$datetime[2], "day")
df %>%
mutate(time = datetime - day_midnight) %>%
summarize(TotalFlux = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))}
}
do.call("rbind",lapply(res, calcTotalFlux))
TotalFlux
7 NA
8 585230.2
9 579017.3
10 NA
11 563689.7
12 NA
Here's another way. More in line with the suggestions of #Alex Brown.
# Another way
last <- out %>%
group_by(day) %>%
filter(datetime == max(datetime)) %>%
ungroup() %>%
mutate(day = day + 1)
first <- out %>%
group_by(day) %>%
filter(datetime == min(datetime)) %>%
ungroup() %>%
mutate(day = day - 1)
d <- rbind(out, last, first) %>%
group_by(day) %>%
arrange(datetime)
n_measures_per_day <- d %>%
summarize(n = n())
d <- left_join(d, n_measures_per_day) %>%
filter(n > 4)
TotalFluxDF <- d %>%
mutate(timeAtMidnight = floor_date(datetime[3], "day"),
time = datetime - timeAtMidnight) %>%
summarize(auc = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))
TotalFluxDF
Source: local data frame [3 x 2]
day auc
(dbl) (dbl)
1 8 585230.2
2 9 579017.3
3 11 563689.7