Duplicated for specific column after grouping - Speed Issue - r

I have working code below, which does what I am after, and does it fine for a test subset of +- 1.000 records. However, in the actual dataset, I have about half a million rows, where suddenly the code takes up over five minutes. Could anyone tell me why or how to improve the code?
The end result I need is to keep only the first value of duplicated ID's, but for each year this should be renewed (i.e. double values are fine if they are in different years, but not in the same year).
Test %>%
group_by(year, id) %>%
mutate(is_duplicate = duplicated(id)) %>%
mutate(oppervlakt = ifelse(is_duplicate == FALSE, oppervlakt, 0))%>%
select(-is_duplicate)

I think you could remove id from grouping and should get same results. See this example:
library(dplyr)
# some sample data:
n_rows <- 1E6
df <- data.frame(year = sample(x = c(2000:2018), size = n_rows, replace = TRUE),
id = sample(x = seq_len(1000), size = n_rows, replace = TRUE),
oppervlakt = rnorm(n = n_rows))
# Roughly 1 second:
system.time(df_slow <- df %>% group_by(year, id) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
# Roughly .1 second:
system.time(df_fast <- df %>% group_by(year) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
all.equal(df_slow, df_fast)
[1] TRUE

Related

Conditionally calculating average time between events by group in R

I am working with a call log data set from a telephone hotline service. There are three call outcomes: Answered, Abandoned & Engaged. I am trying to find out the average time taken by each caller to contact the hotline again if they abandoned the previous call. The time difference can be either seconds, minutes, hours or days but I would like to get all four if possible.
Here is some mock data with the variables I am working with:-
library(wakefield)#for generating the Status variable
library(dplyr)
library(stringi)
library(Pareto)
library(uuid)
n_users<-1300
n_rows <- 365000
set.seed(1)
#data<-data.frame()
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(rep(Date,each=1000),replace = T)
u <- runif(length(Date), 0, 60*60*12) # "noise" to add or subtract from some timepoint
CallDateTime<-as.POSIXlt(u, origin = paste0(Date,"00:00:00"))
CallDateTime
CallOutcome<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=length(Date))
CallOutcome
data<-data.frame(Date,CallDateTime,CallOutcome)
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
data$CallerId <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
data<-data%>%arrange(CallDateTime)
head(data)
So to reiterate, if a caller abandons their call (represented by "Abandoned" in the CallOutcome column), I would like to know the average time taken for the caller to make another call to the service, in the four time units I have mentioned. Any pointers on how I can achieve this would be great :)
Keep rows in the data where the current row is "Abandoned" and the next row is not "Abandoned" for each ID. Find difference in time between every 2 rows to get time required for the caller to make another call to service after it was abandoned, take average of each of the duration to get average time.
library(dplyr)
data %>%
#Test the answer on smaller subset
#slice(1:1000) %>%
arrange(CallerId, CallDateTime) %>%
group_by(CallerId) %>%
filter(CallOutcome == 'Abandoned' & dplyr::lead(CallOutcome) != 'Abandoned' |
CallOutcome != 'Abandoned' & dplyr::lag(CallOutcome) == 'Abandoned') %>%
mutate(group = rep(row_number(), each = 2, length.out = n())) %>%
group_by(group, .add = TRUE) %>%
summarise(avg_sec = difftime(CallDateTime[2], CallDateTime[1], units = 'secs')) %>%
mutate(avg_sec = as.numeric(mean(avg_sec)),
avg_min = avg_sec/60,
avg_hour = avg_min/60,
avg_day = avg_hour/24) -> result
result
First, I would create the lead variable (basically calculate what is the "next" value by group. Then it's just as easy as using whatever unit you want for difftime. A density plot can help you analyze these differences, as shown below.
data <-
data %>%
group_by(CallerId) %>%
mutate(CallDateTime_Next = lead(CallDateTime)) %>%
ungroup() %>%
mutate(
diff_days = difftime(CallDateTime_Next, CallDateTime, units = 'days'),
diff_hours = difftime(CallDateTime_Next, CallDateTime, units = 'hours'),
diff_mins = difftime(CallDateTime_Next, CallDateTime, units = 'mins'),
diff_secs = difftime(CallDateTime_Next, CallDateTime, units = 'secs')
)
data %>%
filter(CallOutcome == 'Abandoned') %>%
ggplot() +
geom_density(aes(x = diff_days))

Can I omit search results from a dataset in r?

My first work in databases was in FileMaker Pro. One of the features I really liked was the ability to do a complex search, and then with one call, omit those results and return anything from the original dataset that wasn't returned in the search. Is there a way to do this in R without having to flip all the logic in a search?
Something like:
everything_except <- df %>%
filter(x == "something complex") %>%
omit()
My initial thought was looking into using a join to keep non-matching values, but thought I would see if there's a different way.
Update with example:
I'm a little hesitant to add an example because I don't want to solve for just this problem but understand if there is an underlying method for multiple cases.
set.seed(123)
event_df <- tibble(time_sec = c(1:120)) %>%
sample_n(100) %>%
mutate(period = sample(c(1,2,3),
size = 100,
replace = TRUE),
event = sample(c("A","B"),
size = 100,
replace = TRUE,
prob = c(0.1,0.9))) %>%
select(period, time_sec, event) %>%
arrange(period, time_sec)
filter_within_timeframe <- function (.data, condition, time, lead_time = 0, lag_time = 0){
condition <- enquo(condition)
time <- enquo(time)
filtered <- .data %>% slice(., 1:max(which(!!condition))) %>%
group_by(., grp = lag(cumsum(!!condition), default = 0)) %>%
filter(., (last(!!time) - !!time) <= lead_time & (last(!!time) -
!!time) >= lag_time)
return(filtered)
}
# this returns 23 rows of data. I would like to return everything except this data
event_df %>% filter_within_timeframe(event == "A", time_sec, 10, 0)
# final output should be 77 rows starting with...
# ~period, ~time_sec, ~event,
# 1,3,"B",
# 1,4,"B",
# 1,5,"B",

How Can I Turn 120/80 into Two Columns (120 and 80)?

I have a column of blood pressures which read as ###/##, all I want to do is splint the numerator into one column and the denominator into another column.
Please help?
library(dplyr)
library(stringr)
df = data.frame(
first_bp = c("120/80","90/60"),
id = c("0001234","0001235"),
amount = c(18.50, -18.50), stringsAsFactors = F)
df %>%
mutate(s0 = str_split(first_bp,"/")) %>%
rowwise() %>%
mutate(systole = as.numeric(s0[1]),
diastole = as.numeric(s0[2])) %>%
select(first_bp, id, amount, systole, diastole)
You can do it with .split() function.
Here is the example to do it:
blood_pressure = '120/80'
blood_pressure = blood_pressure.split('/')
numerator = blood_pressure[0]
denominator = blood_pressure[1]
print(numerator, denominator)
Output:
120 80

Understanding dplyr piping and summarizing function

I'm looking for some help understanding piping and summarizing functions using dplyr. I feel like my coding is a bit verbose and could be simplified. So there is a couple of questions in here because I know I'm missing some concepts, but I'm not quite sure where that lack of knowledge is. I've included my full code at the bottom. Thanks in advance as this is a bit larger ask.
1a. From the example data below and using dplyr, is there a way to calculate the games(dates) per team without using an intermediate table?
1b. I've included my original way to calculate n_games which didn't work. Why?
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK--WHY?]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
Below is my process of creating a summary table. I understand that piping is meant to cut out the creation of some intermediate variables/tables. Where could I combine steps below to create the final table with a minimum number of intermediate steps.
# load librarys ------------------------------------------------
library(tidyverse)
# build sample shot data ---------------------------------------
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# calculate data ----------------------------------------------
# since every row is a shot, the following function counts shots for ea. team
n_shots <- shot_df_ex %>%
count(Team_Name) %>%
rename(N_Shots = n)
n_shots
# do the same for goals for each team
n_goals <- shot_df_ex %>%
filter(Type == "goal") %>%
count(Team_Name,sort = T) %>%
rename(N_Goals = n) %>%
arrange(Team_Name)
n_goals
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
# combine data ------------------------------------------------
# combine columns and add average shots per game
shot_table_ex <- n_games %>%
left_join(n_shots) %>%
left_join(n_goals)
# final table with final average calculations
shot_table_ex <- shot_table_ex %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1)) %>%
arrange(Team_Name)
shot_table_ex
For 1a, you can just pipe straight from the tibble() function to count(). ie.
tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
count(Team_Name,Date)
In 1b, count() is using your column n (ie. the number of shots) as a weighting variable so is summing the total number of shots per team, not the number of rows. It prints a message telling you this:
Using `n` as weighting variable i Quiet this message with `wt = n` or count rows with `wt = 1`
Using count(Team_Name, wt=n()) will give the behaviour you want.
Edit: part 2
shot_table_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
group_by(Team_Name) %>%
summarise(n_shots = n(),
n_goals = sum(Type == "goal"),
n_games = n_distinct(Date)) %>%
mutate(Shots_per_Game = round(n_shots / n_games, 1),
Goals_per_Game = round(n_goals / n_games, 1))
1a. From the example data below and using dplyr, is there a way to calculate the games(dates) per team without using an intermediate table?
This is how I would do it:
shot_df_ex %>%
distinct(Team_Name, Date) %>% #Keeps only the cols given and one of each combo
count(Team_Name)
You can also use unique:
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(N_Games = length(unique(Date))
1b. I've included my original way to calculate n_games which didn't
work. Why?
Your code is working for me. Did you perhaps save over the intermediate table? It's counting the expected 6 per team.
Below is my process of creating a summary table. I understand that piping is meant to cut out the creation of some intermediate
variables/tables. Where could I combine steps below to create the
final table with a minimum number of intermediate steps?
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(
N_Games = length(unique(Date)),
N_Shots = sum(Type == "shot"),
N_Goals = sum(Type == "goal")
) %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1))
You can use multiple summarize steps at a time as long as you don't need to change your grouping. We're taking advantage here (in the sum calls) of the interpretation of True as 1 and False as 0. length will of course give us the length of the vector produced by unique.
this (count) works, but isn't count() just a quicker way to run group_by() %>% summarise()?
count is just a combination of group_by(col) %>% tally() and tally is essentially summarize(x=n()) so yes. :)

Using replace_na for multiple data subsets

I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")

Resources