I am working with a call log data set from a telephone hotline service. There are three call outcomes: Answered, Abandoned & Engaged. I am trying to find out the average time taken by each caller to contact the hotline again if they abandoned the previous call. The time difference can be either seconds, minutes, hours or days but I would like to get all four if possible.
Here is some mock data with the variables I am working with:-
library(wakefield)#for generating the Status variable
library(dplyr)
library(stringi)
library(Pareto)
library(uuid)
n_users<-1300
n_rows <- 365000
set.seed(1)
#data<-data.frame()
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(rep(Date,each=1000),replace = T)
u <- runif(length(Date), 0, 60*60*12) # "noise" to add or subtract from some timepoint
CallDateTime<-as.POSIXlt(u, origin = paste0(Date,"00:00:00"))
CallDateTime
CallOutcome<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=length(Date))
CallOutcome
data<-data.frame(Date,CallDateTime,CallOutcome)
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
data$CallerId <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
data<-data%>%arrange(CallDateTime)
head(data)
So to reiterate, if a caller abandons their call (represented by "Abandoned" in the CallOutcome column), I would like to know the average time taken for the caller to make another call to the service, in the four time units I have mentioned. Any pointers on how I can achieve this would be great :)
Keep rows in the data where the current row is "Abandoned" and the next row is not "Abandoned" for each ID. Find difference in time between every 2 rows to get time required for the caller to make another call to service after it was abandoned, take average of each of the duration to get average time.
library(dplyr)
data %>%
#Test the answer on smaller subset
#slice(1:1000) %>%
arrange(CallerId, CallDateTime) %>%
group_by(CallerId) %>%
filter(CallOutcome == 'Abandoned' & dplyr::lead(CallOutcome) != 'Abandoned' |
CallOutcome != 'Abandoned' & dplyr::lag(CallOutcome) == 'Abandoned') %>%
mutate(group = rep(row_number(), each = 2, length.out = n())) %>%
group_by(group, .add = TRUE) %>%
summarise(avg_sec = difftime(CallDateTime[2], CallDateTime[1], units = 'secs')) %>%
mutate(avg_sec = as.numeric(mean(avg_sec)),
avg_min = avg_sec/60,
avg_hour = avg_min/60,
avg_day = avg_hour/24) -> result
result
First, I would create the lead variable (basically calculate what is the "next" value by group. Then it's just as easy as using whatever unit you want for difftime. A density plot can help you analyze these differences, as shown below.
data <-
data %>%
group_by(CallerId) %>%
mutate(CallDateTime_Next = lead(CallDateTime)) %>%
ungroup() %>%
mutate(
diff_days = difftime(CallDateTime_Next, CallDateTime, units = 'days'),
diff_hours = difftime(CallDateTime_Next, CallDateTime, units = 'hours'),
diff_mins = difftime(CallDateTime_Next, CallDateTime, units = 'mins'),
diff_secs = difftime(CallDateTime_Next, CallDateTime, units = 'secs')
)
data %>%
filter(CallOutcome == 'Abandoned') %>%
ggplot() +
geom_density(aes(x = diff_days))
I have some time series data where I use the rolling_origin function to apply different time series splits to the data which generates a number of lists. The time series starts from 2020-03-01 until 2020-10-30.
I want to start from 2020-04-15 such that I have 1 month before (2020-03-15) and 1 month after (2020-05-15). I can use an ifelse statement to add a 1 for observations after and a 0 for observations before.
rolledData %>%
map(., ~mutate(.x,
treatment_control = ifelse(date >= as.Date("2020-04-15"), 1, 0)
))
But what I want to do is to increment the ifelse date when mapped over the list. So the first one might start on the 2020-04-15 but in the next list in the sequence it would be changed to 2020-04-16, and the next list 2020-04-17, .... , until the end.
I could manually write out the results:
lst1 <- rolledData[[12]] %>%
mutate(
treatment_control = ifelse(date >= as.Date("2020-04-15"), 1, 0)
)
lst2 <- rolledData[[13]] %>%
mutate(
treatment_control = ifelse(date >= as.Date("2020-04-16"), 1, 0)
)
lst3 <- rolledData[[14]] %>%
mutate(
treatment_control = ifelse(date >= as.Date("2020-04-17"), 1, 0)
)
How can I map over the list and increment the treatment_control mutate?
Note: Because I am using financial data (which was just the easiest to obtain for a reproducible example) the weekends are removed (in my data I have a full week)
Data:
library(tidyquant)
library(rsample)
data <- tq_get(c("AAPL"),
get = "stock.prices",
from = "2020-03-01",
to = "2020-10-30")
rolledData <- data %>%
rolling_origin(
data = .,
initial = 60, # 2 months of data
assess = 0,
cumulative = FALSE,
skip = 0
)
rolledData <- rolledData$splits %>%
map(., ~analysis(.x))
If the dates are different, we can pass a vector of custom dates that have the same length as the rolledData in map2
library(dplyr)
library(purrr)
rolleData2 <- rolledData %>%
map2(., newdates,
~ .x %>%
mutate(treatment_control = +(date >= .y)
))
where
newdates <- seq(as.Date("2020-03-15"), length.out = length(rolledData), by = "1 day")
If it is based on the next month from the first 'date' value
library(lubridate)
rolledData2 <- rolledData %>%
map(~ .x %>%
mutate(treatment_control =
+(date >= (first(date) %m+% months(1)))))
As suggested by the title, I would like to extract values from other rows.
In particular, as an example please consider the following dataset:
id.in.group <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
group <- c(1,1,1,2,2,2,3,3,3,4,4,4,1,1,1,2,2,2,3,3,3,4,4,4,1,1,1,2,2,2,3,3,3,4,4,4)
trial <- c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3)
subject <- c("s7","s11","s3","s6","s9","s4","s12","s10","s1","s8","s2","s5","s5","s9","s6","s10","s1","s3","s4","s7","s2","s8","s12","s11","s5","s3","s9","s12","s11","s10","s1","s6","s7","s4","s2","s8")
df <- data.frame(group, id.in.group, trial, subject)
df$other1.id <- 0
df$other2.id <- 0
df$other1.id <- ifelse(df$id.in.group == "1" , 2, df$other1.id)
df$other2.id <- ifelse(df$id.in.group == "1" , 3, df$other2.id)
df$other1.id <- ifelse(df$id.in.group == "2" , 1, df$other1.id)
df$other2.id <- ifelse(df$id.in.group == "2" , 3, df$other2.id)
df$other1.id <- ifelse(df$id.in.group == "3" , 1, df$other1.id)
df$other2.id <- ifelse(df$id.in.group == "3" , 2, df$other2.id)
View(df)
Given the group number (df$group) and the id of the others in the group (df$other1.id and df$other2.id), I would like to create two further variables showing, for each trial and each subject, the value of the other 2 subjects rather than their relative id.in.group, so as to get the two following columns
df$other1.subject<-c("s11","s7","s7","s9","s6","s6","s10","s12","s12","s2","s8","s8","s9","s5","s5","s1","s10","s10","s7","s4","s4","s12","s8","s8", "s3","s5","s5","s11","s12","s12","s6","s1","s1","s2","s4","s4")
df$other2.subject<-c("s3","s3","s11","s4","s4","s9","s1","s1","s10","s5","s5","s2","s6","s6","s9","s3","s3","s1","s2","s2","s7","s11","s11","s12","s9","s9","s3","s10","s10","s11","s7","s7","s6","s8","s8","s2")
View(df)
For instance, if trial = 1 and id.in.group = 1 (or alternatively, subject = s7), then other1.subject = s11 while other2.subject = s3. I would like to extract such values for each id.in.group (or each subject) or for each row.
I beg you a pardon if I don't provide any previous attempt but, honestly, I have no clue about how to tackle the problem. I remain open to any further clarification.
Many thanks for all your help!
You need to left join df with itself two times - one for other1, second for other2:
library(dplyr)
df %>%
left_join(
df %>%
select(group, trial, other1.id = id.in.group, other1.subject = subject),
by = c("group", "trial", "other1.id")
) %>%
left_join(
df %>%
select(group, trial, other2.id = id.in.group, other2.subject = subject),
by = c("group", "trial", "other2.id")
)
I've created a few custom tweaks to zoo::na.locf before, but this one is driving me nuts. I need a function that will carry forward the last observation of a column only if the values in another column haven't changed; and it all has to be grouped by a primary key. For example:
library(dplyr)
set.seed(20180409)
data <- data.frame(Id = rep(1:10, each = 24),
Date = rep(seq.Date(as.Date("2016-01-01"), as.Date("2017-12-01"),
by = "month"), 10),
FillCol = replace(runif(240), runif(240) < 0.9, NA),
CheckCol = rep(letters[1:7], each = 7, length.out = 240))
data <- data %>%
group_by(Id) %>%
mutate(CheckColHasChanged = replace(lag(CheckCol) != CheckCol,
is.na(lag(CheckCol) != CheckCol), TRUE),
FillColIsNA = is.na(FillCol))
So I'm trying to carry foward any observations of FillCol, but once we hit an observation where CheckColHasChanged, stop the carry forward until the next valid observation in FillCol. I can do it in a loop but I'm struggling to do it properly.
Fill <- TRUE #indicator for whether or not I should be carrying forward
for(row in 2:nrow(data)){
#if the CheckCol has changed, don't fill
if(data$CheckColHasChanged[row]){Fill <- FALSE}
#if we should fill and still have the same Id, then fill from the last obs
if(Fill & data$Id[row] == data$Id[row - 1]){
data$FillCol[row] <- data$FillCol[row - 1]
}else{ #if there's a valid obs in FillCol, set the indicator back to true
if(!data$FillColIsNA[row]){Fill <- TRUE}
}
}
Any help would be greatly appreciated!
Comment to answer: this is just filling in by both Id and CheckCol:
data %>% group_by(Id, CheckCol) %>%
mutate(result = zoo::na.locf(FillCol, na.rm = FALSE))
The way you describe CheckCol, it is treated just like an ID. There's no difference between "only if the values in another column haven't changed" and "grouped by a primary key". You just have two columns to group by.
I've measured N20 flux from soil at multiple timepoints in the day (not equally spaced). I'm trying to calculate the total N20 flux from soil for a subset of days by finding the area under the curve for the given day. I know how to do this when using only measures from the given day, however, I'd like to include the last measure of the previous day and the first measure of the following day to improve the estimation of the curve.
Here's an example to give a more concrete idea:
library(MESS)
library(lubridate)
library(dplyr)
Generate Reproducible Example
datetime <- seq(ymd_hm('2015-04-07 11:20'),ymd('2015-04-13'), by = 'hours')
dat <- data.frame(datetime, day = day(datetime), Flux = rnorm(n = length(datetime), mean = 400, sd = 20))
useDate <- data.frame(day = c(7:12), DateGood = c("No", "Yes", "Yes", "No", "Yes", "No"))
dat <- left_join(dat, useDate)
Some days are "bad" (too many missing measures) and some are "Good" (usable). The goal is to filter all measurements (rows) that occurred on a "Good" day as well as the last measurement from the day before and the first measurement on the next day.
out <- dat %>%
mutate(lagDateGood = lag(DateGood),
leadDateGood = lead(DateGood)) %>%
filter(lagDateGood != "No" | leadDateGood != "No")
Now I need to calculate the area under the curve - this is not correct
out2 <- out %>%
group_by(day) %>%
mutate(hourOfday = hour(datetime) + minute(datetime)/60) %>%
summarize(auc = auc(x = hourOfday, y = Flux, from = 0, to = 24, type = "spline"))
The trouble is that I don't include the measurements on end of previous day and start of following day when calculating AUC. Also, I get an estimate of flux for day 10, which is a "bad" day.
I think the crux of my question has to do with groups. Some measurements need to be in multiple groups (for example the last measurement on day 8 would be used in estimating AUC for day 8 and day 9). Do you have suggestions for how I could form new groups? Or might there be a completely different way to achieve the goal?
For what it's worth, this is what I did. The answer really lies in the question I linked to in the comments. Starting with the dataframe "out" from the question:
#Now I need to calculate the area under the curve for each day
n <- nrow(out)
extract <- function(ix) out[seq(max(1, min(ix)-1), min(n, max(ix) + 1)), ]
res <- lapply(split(1:n, out$day), extract)
calcTotalFlux <- function(df) {
if (nrow(df) < 10) { # make sure the day has at least 10 measures
NA
} else {
day_midnight <- floor_date(df$datetime[2], "day")
df %>%
mutate(time = datetime - day_midnight) %>%
summarize(TotalFlux = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))}
}
do.call("rbind",lapply(res, calcTotalFlux))
TotalFlux
7 NA
8 585230.2
9 579017.3
10 NA
11 563689.7
12 NA
Here's another way. More in line with the suggestions of #Alex Brown.
# Another way
last <- out %>%
group_by(day) %>%
filter(datetime == max(datetime)) %>%
ungroup() %>%
mutate(day = day + 1)
first <- out %>%
group_by(day) %>%
filter(datetime == min(datetime)) %>%
ungroup() %>%
mutate(day = day - 1)
d <- rbind(out, last, first) %>%
group_by(day) %>%
arrange(datetime)
n_measures_per_day <- d %>%
summarize(n = n())
d <- left_join(d, n_measures_per_day) %>%
filter(n > 4)
TotalFluxDF <- d %>%
mutate(timeAtMidnight = floor_date(datetime[3], "day"),
time = datetime - timeAtMidnight) %>%
summarize(auc = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))
TotalFluxDF
Source: local data frame [3 x 2]
day auc
(dbl) (dbl)
1 8 585230.2
2 9 579017.3
3 11 563689.7