na.locf only if another column hasn't changed - r

I've created a few custom tweaks to zoo::na.locf before, but this one is driving me nuts. I need a function that will carry forward the last observation of a column only if the values in another column haven't changed; and it all has to be grouped by a primary key. For example:
library(dplyr)
set.seed(20180409)
data <- data.frame(Id = rep(1:10, each = 24),
Date = rep(seq.Date(as.Date("2016-01-01"), as.Date("2017-12-01"),
by = "month"), 10),
FillCol = replace(runif(240), runif(240) < 0.9, NA),
CheckCol = rep(letters[1:7], each = 7, length.out = 240))
data <- data %>%
group_by(Id) %>%
mutate(CheckColHasChanged = replace(lag(CheckCol) != CheckCol,
is.na(lag(CheckCol) != CheckCol), TRUE),
FillColIsNA = is.na(FillCol))
So I'm trying to carry foward any observations of FillCol, but once we hit an observation where CheckColHasChanged, stop the carry forward until the next valid observation in FillCol. I can do it in a loop but I'm struggling to do it properly.
Fill <- TRUE #indicator for whether or not I should be carrying forward
for(row in 2:nrow(data)){
#if the CheckCol has changed, don't fill
if(data$CheckColHasChanged[row]){Fill <- FALSE}
#if we should fill and still have the same Id, then fill from the last obs
if(Fill & data$Id[row] == data$Id[row - 1]){
data$FillCol[row] <- data$FillCol[row - 1]
}else{ #if there's a valid obs in FillCol, set the indicator back to true
if(!data$FillColIsNA[row]){Fill <- TRUE}
}
}
Any help would be greatly appreciated!

Comment to answer: this is just filling in by both Id and CheckCol:
data %>% group_by(Id, CheckCol) %>%
mutate(result = zoo::na.locf(FillCol, na.rm = FALSE))
The way you describe CheckCol, it is treated just like an ID. There's no difference between "only if the values in another column haven't changed" and "grouped by a primary key". You just have two columns to group by.

Related

Conditionally calculating average time between events by group in R

I am working with a call log data set from a telephone hotline service. There are three call outcomes: Answered, Abandoned & Engaged. I am trying to find out the average time taken by each caller to contact the hotline again if they abandoned the previous call. The time difference can be either seconds, minutes, hours or days but I would like to get all four if possible.
Here is some mock data with the variables I am working with:-
library(wakefield)#for generating the Status variable
library(dplyr)
library(stringi)
library(Pareto)
library(uuid)
n_users<-1300
n_rows <- 365000
set.seed(1)
#data<-data.frame()
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(rep(Date,each=1000),replace = T)
u <- runif(length(Date), 0, 60*60*12) # "noise" to add or subtract from some timepoint
CallDateTime<-as.POSIXlt(u, origin = paste0(Date,"00:00:00"))
CallDateTime
CallOutcome<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=length(Date))
CallOutcome
data<-data.frame(Date,CallDateTime,CallOutcome)
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
data$CallerId <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
data<-data%>%arrange(CallDateTime)
head(data)
So to reiterate, if a caller abandons their call (represented by "Abandoned" in the CallOutcome column), I would like to know the average time taken for the caller to make another call to the service, in the four time units I have mentioned. Any pointers on how I can achieve this would be great :)
Keep rows in the data where the current row is "Abandoned" and the next row is not "Abandoned" for each ID. Find difference in time between every 2 rows to get time required for the caller to make another call to service after it was abandoned, take average of each of the duration to get average time.
library(dplyr)
data %>%
#Test the answer on smaller subset
#slice(1:1000) %>%
arrange(CallerId, CallDateTime) %>%
group_by(CallerId) %>%
filter(CallOutcome == 'Abandoned' & dplyr::lead(CallOutcome) != 'Abandoned' |
CallOutcome != 'Abandoned' & dplyr::lag(CallOutcome) == 'Abandoned') %>%
mutate(group = rep(row_number(), each = 2, length.out = n())) %>%
group_by(group, .add = TRUE) %>%
summarise(avg_sec = difftime(CallDateTime[2], CallDateTime[1], units = 'secs')) %>%
mutate(avg_sec = as.numeric(mean(avg_sec)),
avg_min = avg_sec/60,
avg_hour = avg_min/60,
avg_day = avg_hour/24) -> result
result
First, I would create the lead variable (basically calculate what is the "next" value by group. Then it's just as easy as using whatever unit you want for difftime. A density plot can help you analyze these differences, as shown below.
data <-
data %>%
group_by(CallerId) %>%
mutate(CallDateTime_Next = lead(CallDateTime)) %>%
ungroup() %>%
mutate(
diff_days = difftime(CallDateTime_Next, CallDateTime, units = 'days'),
diff_hours = difftime(CallDateTime_Next, CallDateTime, units = 'hours'),
diff_mins = difftime(CallDateTime_Next, CallDateTime, units = 'mins'),
diff_secs = difftime(CallDateTime_Next, CallDateTime, units = 'secs')
)
data %>%
filter(CallOutcome == 'Abandoned') %>%
ggplot() +
geom_density(aes(x = diff_days))

Can I omit search results from a dataset in r?

My first work in databases was in FileMaker Pro. One of the features I really liked was the ability to do a complex search, and then with one call, omit those results and return anything from the original dataset that wasn't returned in the search. Is there a way to do this in R without having to flip all the logic in a search?
Something like:
everything_except <- df %>%
filter(x == "something complex") %>%
omit()
My initial thought was looking into using a join to keep non-matching values, but thought I would see if there's a different way.
Update with example:
I'm a little hesitant to add an example because I don't want to solve for just this problem but understand if there is an underlying method for multiple cases.
set.seed(123)
event_df <- tibble(time_sec = c(1:120)) %>%
sample_n(100) %>%
mutate(period = sample(c(1,2,3),
size = 100,
replace = TRUE),
event = sample(c("A","B"),
size = 100,
replace = TRUE,
prob = c(0.1,0.9))) %>%
select(period, time_sec, event) %>%
arrange(period, time_sec)
filter_within_timeframe <- function (.data, condition, time, lead_time = 0, lag_time = 0){
condition <- enquo(condition)
time <- enquo(time)
filtered <- .data %>% slice(., 1:max(which(!!condition))) %>%
group_by(., grp = lag(cumsum(!!condition), default = 0)) %>%
filter(., (last(!!time) - !!time) <= lead_time & (last(!!time) -
!!time) >= lag_time)
return(filtered)
}
# this returns 23 rows of data. I would like to return everything except this data
event_df %>% filter_within_timeframe(event == "A", time_sec, 10, 0)
# final output should be 77 rows starting with...
# ~period, ~time_sec, ~event,
# 1,3,"B",
# 1,4,"B",
# 1,5,"B",

Making new variables for every group of observation in R

I have 11 variables in my dataframe. The first is unique identifier of observation (a plane). The second one is a number from 1 to 21 representing flight of a given plane. The rest of the variables are time, velocity, distance, etc.
What I want to do is make new variables for every group (number) of flight e.g. time_1, time_2,..., velocity_1, velocity_2, etc. and consequently, reduce the number of observations (the repeating ones).
I don't really have idea how to start. I was thinking about a mutate function like:
mutate(df, time_1 = ifelse(n_flight == 1, time, NA))
But that would be a lot of typing and a new problem may appear, perhaps.
Basically, you want to convert long to wide data for each variable. You can lapply over these with tidyr::spread in that case. Suppose the data looks like the following:
library(dplyr)
library(tidyr)
df <- data.frame(
ID = c(rep("A", 3), rep("B", 3)),
n_flight = rep(seq(3), 2),
time = seq(19, 24),
velocity = rev(seq(65, 60))
)
Then the following will generate your outcome of interest, as long as you get rid of the extra ID variables.
lapply(
setdiff(names(df), c("ID", "n_flight")), function(x) {
df %>%
select(ID, n_flight, !!x) %>%
tidyr::spread(., key = "n_flight", value = x) %>%
setNames(paste(x, names(.), sep = "_"))
}
) %>%
bind_cols()
Let me know if this wasn't what you were going for.

Create column based on multiple conditions in r

I have a data frame with 3 columns: individual ID, trip (which is sequenced by ID), and forage (yes or no):
example <- data.frame(IDs = c(rep("A",30),rep("B",30)),
timestamp = seq(c(ISOdate(2016,10,01)), by = "day", length.out = 60),
trip = c(rep("1",15),rep("2",15)),
forage = c(rep("Yes",3),rep("No",5),rep("Yes",3),rep("No",4),rep("Yes",7),rep("No",8)))
I want to create two separate columns that will list foraging events for each observation. In the first column, I want to number each observation with foraging = "yes" within ID and trip (so, each trip within individual will have x number of foraging events, starting over again with "1" for the next trip within individual). This column would look like:
example$forageEvent1 <- c(rep(1,3),rep("NA",5),rep(2,3),rep("NA",4),rep(1,7),rep("NA",8),rep(1,3),rep("NA",5),rep(2,3),rep("NA",4),rep(1,7),rep("NA",8))
The second column will number the foraging events by ID only:
example$forageEvent2 <- c(rep(1,3),rep("NA",5),rep(2,3),rep("NA",4),rep(3,7),rep("NA",8),rep(1,3),rep("NA",5),rep(2,3),rep("NA",4),rep(3,7),rep("NA",8))
I can subset/pipe down to individual and trip & have tried ifelse(), but have no idea how to write a code that will create a sequence of events. Thanks all.
EDIT: the code below, edited from a comment, gets close. However, it prints starting with "Forage0" instead of "Forage1".
library(dplyr)
Test_example <- example %>%
group_by(IDs) %>%
mutate(
ForagebyID = case_when(
forage == "Yes" ~ "Forage",
forage == "No" ~"NonForage"),
rleid = cumsum(ForagebyID != lag(ForagebyID, 1, default = "NA")),
ForagebyID = case_when(
ForagebyID == "Forage" ~ paste0(ForagebyID, rleid %/% 2),
TRUE ~ "NonForage"),
rleid = NULL
)
I think this will do what you want:
library(dplyr)
example <- data.frame(IDs = c(rep("A",30),rep("B",30)),
timestamp = seq(c(ISOdate(2016,10,01)), by = "day", length.out = 60),
trip = c(rep("1",15),rep("2",15)),
forage = c(rep("Yes",3),rep("No",5),rep("Yes",3),rep("No",4),rep("Yes",7),rep("No",8)))
Test_example <- example %>%
arrange(IDs, timestamp) %>%
group_by(IDs, trip) %>%
mutate(forageEvent1 = case_when(forage == "No" ~ 0,
TRUE ~ cumsum(forage != lag(forage, default = 1)) %/% 2 + 1)) %>%
group_by(IDs) %>%
mutate(forageEvent2 = case_when(forage == "No" ~ 0,
TRUE ~ cumsum(forage != lag(forage, default = 1)) %/% 2 + 1))

Duplicated for specific column after grouping - Speed Issue

I have working code below, which does what I am after, and does it fine for a test subset of +- 1.000 records. However, in the actual dataset, I have about half a million rows, where suddenly the code takes up over five minutes. Could anyone tell me why or how to improve the code?
The end result I need is to keep only the first value of duplicated ID's, but for each year this should be renewed (i.e. double values are fine if they are in different years, but not in the same year).
Test %>%
group_by(year, id) %>%
mutate(is_duplicate = duplicated(id)) %>%
mutate(oppervlakt = ifelse(is_duplicate == FALSE, oppervlakt, 0))%>%
select(-is_duplicate)
I think you could remove id from grouping and should get same results. See this example:
library(dplyr)
# some sample data:
n_rows <- 1E6
df <- data.frame(year = sample(x = c(2000:2018), size = n_rows, replace = TRUE),
id = sample(x = seq_len(1000), size = n_rows, replace = TRUE),
oppervlakt = rnorm(n = n_rows))
# Roughly 1 second:
system.time(df_slow <- df %>% group_by(year, id) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
# Roughly .1 second:
system.time(df_fast <- df %>% group_by(year) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
all.equal(df_slow, df_fast)
[1] TRUE

Resources