Understanding dplyr piping and summarizing function - r

I'm looking for some help understanding piping and summarizing functions using dplyr. I feel like my coding is a bit verbose and could be simplified. So there is a couple of questions in here because I know I'm missing some concepts, but I'm not quite sure where that lack of knowledge is. I've included my full code at the bottom. Thanks in advance as this is a bit larger ask.
1a. From the example data below and using dplyr, is there a way to calculate the games(dates) per team without using an intermediate table?
1b. I've included my original way to calculate n_games which didn't work. Why?
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK--WHY?]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
Below is my process of creating a summary table. I understand that piping is meant to cut out the creation of some intermediate variables/tables. Where could I combine steps below to create the final table with a minimum number of intermediate steps.
# load librarys ------------------------------------------------
library(tidyverse)
# build sample shot data ---------------------------------------
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# calculate data ----------------------------------------------
# since every row is a shot, the following function counts shots for ea. team
n_shots <- shot_df_ex %>%
count(Team_Name) %>%
rename(N_Shots = n)
n_shots
# do the same for goals for each team
n_goals <- shot_df_ex %>%
filter(Type == "goal") %>%
count(Team_Name,sort = T) %>%
rename(N_Goals = n) %>%
arrange(Team_Name)
n_goals
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
# combine data ------------------------------------------------
# combine columns and add average shots per game
shot_table_ex <- n_games %>%
left_join(n_shots) %>%
left_join(n_goals)
# final table with final average calculations
shot_table_ex <- shot_table_ex %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1)) %>%
arrange(Team_Name)
shot_table_ex

For 1a, you can just pipe straight from the tibble() function to count(). ie.
tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
count(Team_Name,Date)
In 1b, count() is using your column n (ie. the number of shots) as a weighting variable so is summing the total number of shots per team, not the number of rows. It prints a message telling you this:
Using `n` as weighting variable i Quiet this message with `wt = n` or count rows with `wt = 1`
Using count(Team_Name, wt=n()) will give the behaviour you want.
Edit: part 2
shot_table_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
group_by(Team_Name) %>%
summarise(n_shots = n(),
n_goals = sum(Type == "goal"),
n_games = n_distinct(Date)) %>%
mutate(Shots_per_Game = round(n_shots / n_games, 1),
Goals_per_Game = round(n_goals / n_games, 1))

1a. From the example data below and using dplyr, is there a way to calculate the games(dates) per team without using an intermediate table?
This is how I would do it:
shot_df_ex %>%
distinct(Team_Name, Date) %>% #Keeps only the cols given and one of each combo
count(Team_Name)
You can also use unique:
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(N_Games = length(unique(Date))
1b. I've included my original way to calculate n_games which didn't
work. Why?
Your code is working for me. Did you perhaps save over the intermediate table? It's counting the expected 6 per team.
Below is my process of creating a summary table. I understand that piping is meant to cut out the creation of some intermediate
variables/tables. Where could I combine steps below to create the
final table with a minimum number of intermediate steps?
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(
N_Games = length(unique(Date)),
N_Shots = sum(Type == "shot"),
N_Goals = sum(Type == "goal")
) %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1))
You can use multiple summarize steps at a time as long as you don't need to change your grouping. We're taking advantage here (in the sum calls) of the interpretation of True as 1 and False as 0. length will of course give us the length of the vector produced by unique.
this (count) works, but isn't count() just a quicker way to run group_by() %>% summarise()?
count is just a combination of group_by(col) %>% tally() and tally is essentially summarize(x=n()) so yes. :)

Related

Aggregation and mean calculation with dplyr

I have a chunk of code that aggregates timestamps of a large dataset (see below). Each timestamp represents a tweet. The code aggregates the tweets per week, it works fine. Now, I also have a column with the sentiment value of each tweet. I would like to know if it is possible to calculate the mean sentiment of the tweets per week. It would be nice to have at the end one dataset with the amount of tweets per week and the mean sentiment of these aggregated tweets. Please let me know if you've got some hints :)
Kind regards,
Daniel
weekly_counts_2 <- df_bw %>%
drop_na(Timestamp) %>%
mutate(weekly_cases = floor_date(
Timestamp,
unit = "week")) %>%
count(weekly_cases) %>%
tidyr::complete(
weekly_cases = seq.Date(
from = min(weekly_cases),
to = max(weekly_cases),
by = "week"),
fill = list(n = 0))
It is difficult to verify the answer since no data has been shared but based on the description provided here is a solution that you can try.
library(dplyr)
library(tidyr)
library(lubridate)
weekly_counts_2 <- df_bw %>%
drop_na(Timestamp) %>%
mutate(weekly_cases = floor_date(Timestamp,unit = "week")) %>%
group_by(weekly_cases) %>%
summarise(mean_sentiment = mean(sentiment_value, na.rm = TRUE),
count = n()) %>%
complete(weekly_cases = seq.Date(min(weekly_cases),
max(weekly_cases),by = "week"), fill = list(n = 0))
I have assumed the column with the sentiment value is called sentiment_value, change it accordingly to your data.

How can I compute descriptive statistics within variable levels for which I have already generated counts?

I am trying to obtain detailed descriptive statistics. I have selected three variables (Baseline, Experience, and Engagement_overall) from the main data frame Users and have calculated sample sizes for each. Here is one example using this code:
Engagement <-
Users %>%
group_by(Engagement_overall)%>%
summarise(engagement_count=n())
Here is the output dataframe
I am trying to calculate the average age, %of females, %of the province (Ontario, British-Columbia, and Newfoundland-Labrador), median personal income, and baseline mean steps per day within each level of Engagement_overall (including the other variables of Baseline and Experience). All aforementioned variables with the exception of baseline mean steps per day are included in the main data frame Users:
Main data frame Users
How could I write a code to include the metrics (average age, %of females, %of the province, median personal income, and baseline mean steps per day) within each level of Engagement_overall? Baseline and Experience also have 4 levels. Thanks for any help you can provide.
One way to do this is to make the tables separately then join all of them. For an example, I randomly generated some data:
library(tidyverse)
df <- tibble(
user_id = 1:100,
gender = sample(c('Male', 'Female'), 100, replace = T),
age = sample(10:80, 100, replace = T),
income_level = sample(c('10k-20k', '20k-50k', '50k-70k', '70k+'), 100, replace = T),
province = sample(c('ON', 'BC', 'NL'), 100, replace = T),
baseline = sample(c('low', 'medium', 'high'), 100, replace = T),
experience = sample(1:3, 100, replace = T),
engagement_overall = sample(1:3, 100, replace = T)
)
We can find the percentage of females by finding the average of a binary variable.
fm <- df %>%
mutate(female = if_else(gender == 'Female', 1, 0)) %>% # make a column that has value 1 if female, otherwise value is 0
group_by(engagement_overall, province) %>% #
summarise(perc_female = mean(female))
Make the province table in the same way
prv <- df %>%
group_by(engagement_overall, province) %>%
summarise(prov_count = n()) %>%
group_by(engagement_overall) %>%
mutate(total = sum(prov_count),
prov_count_freq = prov_count/total) %>%
select(engagement_overall, province, prov_count_freq)
Then you can combind them all using inner_join
summary_df <-
df %>%
group_by(engagement_overall, province) %>%
summarise(avg_age = mean(age)) %>%
ungroup() %>%
inner_join(fm) %>%
inner_join(prv)
As for baseline and median income, you can't do statistics on qualitative data, but you could find the mode, which you can do in a similar way.
df %>%
group_by(engagement_overall, province, baseline) %>%
summarise(count = n()) %>%
filter(count == max(count))
If you wanted to find median age, you could make the column numeric by replacing each category with a number. For example replace '20-40k' with 30. Maybe checkout the case_when function from dplyr to do this.
?case_when

Conditionally calculating average time between events by group in R

I am working with a call log data set from a telephone hotline service. There are three call outcomes: Answered, Abandoned & Engaged. I am trying to find out the average time taken by each caller to contact the hotline again if they abandoned the previous call. The time difference can be either seconds, minutes, hours or days but I would like to get all four if possible.
Here is some mock data with the variables I am working with:-
library(wakefield)#for generating the Status variable
library(dplyr)
library(stringi)
library(Pareto)
library(uuid)
n_users<-1300
n_rows <- 365000
set.seed(1)
#data<-data.frame()
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(rep(Date,each=1000),replace = T)
u <- runif(length(Date), 0, 60*60*12) # "noise" to add or subtract from some timepoint
CallDateTime<-as.POSIXlt(u, origin = paste0(Date,"00:00:00"))
CallDateTime
CallOutcome<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=length(Date))
CallOutcome
data<-data.frame(Date,CallDateTime,CallOutcome)
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
data$CallerId <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
data<-data%>%arrange(CallDateTime)
head(data)
So to reiterate, if a caller abandons their call (represented by "Abandoned" in the CallOutcome column), I would like to know the average time taken for the caller to make another call to the service, in the four time units I have mentioned. Any pointers on how I can achieve this would be great :)
Keep rows in the data where the current row is "Abandoned" and the next row is not "Abandoned" for each ID. Find difference in time between every 2 rows to get time required for the caller to make another call to service after it was abandoned, take average of each of the duration to get average time.
library(dplyr)
data %>%
#Test the answer on smaller subset
#slice(1:1000) %>%
arrange(CallerId, CallDateTime) %>%
group_by(CallerId) %>%
filter(CallOutcome == 'Abandoned' & dplyr::lead(CallOutcome) != 'Abandoned' |
CallOutcome != 'Abandoned' & dplyr::lag(CallOutcome) == 'Abandoned') %>%
mutate(group = rep(row_number(), each = 2, length.out = n())) %>%
group_by(group, .add = TRUE) %>%
summarise(avg_sec = difftime(CallDateTime[2], CallDateTime[1], units = 'secs')) %>%
mutate(avg_sec = as.numeric(mean(avg_sec)),
avg_min = avg_sec/60,
avg_hour = avg_min/60,
avg_day = avg_hour/24) -> result
result
First, I would create the lead variable (basically calculate what is the "next" value by group. Then it's just as easy as using whatever unit you want for difftime. A density plot can help you analyze these differences, as shown below.
data <-
data %>%
group_by(CallerId) %>%
mutate(CallDateTime_Next = lead(CallDateTime)) %>%
ungroup() %>%
mutate(
diff_days = difftime(CallDateTime_Next, CallDateTime, units = 'days'),
diff_hours = difftime(CallDateTime_Next, CallDateTime, units = 'hours'),
diff_mins = difftime(CallDateTime_Next, CallDateTime, units = 'mins'),
diff_secs = difftime(CallDateTime_Next, CallDateTime, units = 'secs')
)
data %>%
filter(CallOutcome == 'Abandoned') %>%
ggplot() +
geom_density(aes(x = diff_days))

Can I omit search results from a dataset in r?

My first work in databases was in FileMaker Pro. One of the features I really liked was the ability to do a complex search, and then with one call, omit those results and return anything from the original dataset that wasn't returned in the search. Is there a way to do this in R without having to flip all the logic in a search?
Something like:
everything_except <- df %>%
filter(x == "something complex") %>%
omit()
My initial thought was looking into using a join to keep non-matching values, but thought I would see if there's a different way.
Update with example:
I'm a little hesitant to add an example because I don't want to solve for just this problem but understand if there is an underlying method for multiple cases.
set.seed(123)
event_df <- tibble(time_sec = c(1:120)) %>%
sample_n(100) %>%
mutate(period = sample(c(1,2,3),
size = 100,
replace = TRUE),
event = sample(c("A","B"),
size = 100,
replace = TRUE,
prob = c(0.1,0.9))) %>%
select(period, time_sec, event) %>%
arrange(period, time_sec)
filter_within_timeframe <- function (.data, condition, time, lead_time = 0, lag_time = 0){
condition <- enquo(condition)
time <- enquo(time)
filtered <- .data %>% slice(., 1:max(which(!!condition))) %>%
group_by(., grp = lag(cumsum(!!condition), default = 0)) %>%
filter(., (last(!!time) - !!time) <= lead_time & (last(!!time) -
!!time) >= lag_time)
return(filtered)
}
# this returns 23 rows of data. I would like to return everything except this data
event_df %>% filter_within_timeframe(event == "A", time_sec, 10, 0)
# final output should be 77 rows starting with...
# ~period, ~time_sec, ~event,
# 1,3,"B",
# 1,4,"B",
# 1,5,"B",

Duplicated for specific column after grouping - Speed Issue

I have working code below, which does what I am after, and does it fine for a test subset of +- 1.000 records. However, in the actual dataset, I have about half a million rows, where suddenly the code takes up over five minutes. Could anyone tell me why or how to improve the code?
The end result I need is to keep only the first value of duplicated ID's, but for each year this should be renewed (i.e. double values are fine if they are in different years, but not in the same year).
Test %>%
group_by(year, id) %>%
mutate(is_duplicate = duplicated(id)) %>%
mutate(oppervlakt = ifelse(is_duplicate == FALSE, oppervlakt, 0))%>%
select(-is_duplicate)
I think you could remove id from grouping and should get same results. See this example:
library(dplyr)
# some sample data:
n_rows <- 1E6
df <- data.frame(year = sample(x = c(2000:2018), size = n_rows, replace = TRUE),
id = sample(x = seq_len(1000), size = n_rows, replace = TRUE),
oppervlakt = rnorm(n = n_rows))
# Roughly 1 second:
system.time(df_slow <- df %>% group_by(year, id) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
# Roughly .1 second:
system.time(df_fast <- df %>% group_by(year) %>% mutate(oppervlakt = ifelse(duplicated(id), 0, oppervlakt)))
all.equal(df_slow, df_fast)
[1] TRUE

Resources