Aggregation and mean calculation with dplyr - r

I have a chunk of code that aggregates timestamps of a large dataset (see below). Each timestamp represents a tweet. The code aggregates the tweets per week, it works fine. Now, I also have a column with the sentiment value of each tweet. I would like to know if it is possible to calculate the mean sentiment of the tweets per week. It would be nice to have at the end one dataset with the amount of tweets per week and the mean sentiment of these aggregated tweets. Please let me know if you've got some hints :)
Kind regards,
Daniel
weekly_counts_2 <- df_bw %>%
drop_na(Timestamp) %>%
mutate(weekly_cases = floor_date(
Timestamp,
unit = "week")) %>%
count(weekly_cases) %>%
tidyr::complete(
weekly_cases = seq.Date(
from = min(weekly_cases),
to = max(weekly_cases),
by = "week"),
fill = list(n = 0))

It is difficult to verify the answer since no data has been shared but based on the description provided here is a solution that you can try.
library(dplyr)
library(tidyr)
library(lubridate)
weekly_counts_2 <- df_bw %>%
drop_na(Timestamp) %>%
mutate(weekly_cases = floor_date(Timestamp,unit = "week")) %>%
group_by(weekly_cases) %>%
summarise(mean_sentiment = mean(sentiment_value, na.rm = TRUE),
count = n()) %>%
complete(weekly_cases = seq.Date(min(weekly_cases),
max(weekly_cases),by = "week"), fill = list(n = 0))
I have assumed the column with the sentiment value is called sentiment_value, change it accordingly to your data.

Related

Pivot_longer on a file that updates daily

Trying to pivot a data frame of Covid-19 deaths imported each day from the web (Johns Hopkins data). The current file is 414 columns wide, growing by one column per day. Pivot_longer works when I specify the width by column index but triggers an error when I try last_col().
For example, this works:
CountyDeathsC <- CountyDeathsB %>%
pivot_longer(cols = c(4:414), names_to="Date", values_to="Cumulative Deaths") %>%
group_by(FIPS, Population, Combined_Key) %>%
mutate(Date = mdy(Date)) %>%
mutate(DeathsToday = `Cumulative Deaths` - dplyr::lag(`Cumulative Deaths`,
n = 1, default = 0),
Deaths7DayAvg = round(zoo::rollapplyr(DeathsToday, 7, mean, na.rm=TRUE, fill=NA))) %>%
mutate(CumDeathsPer100k = `Cumulative Deaths` / (Population / 100000))
This code (excerpt) does not:
pivot_longer(cols = c(4:last_col()), names_to="Date", values_to="Cumulative Deaths")
I get an error saying the term "last_col()" is not recognized. So it looks like I have to go in each day and manually insert the index for the last column. Or is there a better answer?

Understanding dplyr piping and summarizing function

I'm looking for some help understanding piping and summarizing functions using dplyr. I feel like my coding is a bit verbose and could be simplified. So there is a couple of questions in here because I know I'm missing some concepts, but I'm not quite sure where that lack of knowledge is. I've included my full code at the bottom. Thanks in advance as this is a bit larger ask.
1a. From the example data below and using dplyr, is there a way to calculate the games(dates) per team without using an intermediate table?
1b. I've included my original way to calculate n_games which didn't work. Why?
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK--WHY?]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
Below is my process of creating a summary table. I understand that piping is meant to cut out the creation of some intermediate variables/tables. Where could I combine steps below to create the final table with a minimum number of intermediate steps.
# load librarys ------------------------------------------------
library(tidyverse)
# build sample shot data ---------------------------------------
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))
)
# calculate data ----------------------------------------------
# since every row is a shot, the following function counts shots for ea. team
n_shots <- shot_df_ex %>%
count(Team_Name) %>%
rename(N_Shots = n)
n_shots
# do the same for goals for each team
n_goals <- shot_df_ex %>%
filter(Type == "goal") %>%
count(Team_Name,sort = T) %>%
rename(N_Goals = n) %>%
arrange(Team_Name)
n_goals
# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>%
count(Team_Name,Date)
n_shots_per_game
# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of
# using an intermediate tibble?
# count number of games using the tibble created above [DOES NOT WORK]
n_games <- n_shots_per_game %>%
count(Team_Name)
n_games #what is this counting? It should be 6 for each.
# this works, but isn't count() just a quicker way to run
# group_by() %>% summarise()?
n_games <- n_shots_per_game %>%
group_by(Team_Name) %>%
summarise(N_Games=n())
n_games
# combine data ------------------------------------------------
# combine columns and add average shots per game
shot_table_ex <- n_games %>%
left_join(n_shots) %>%
left_join(n_goals)
# final table with final average calculations
shot_table_ex <- shot_table_ex %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1)) %>%
arrange(Team_Name)
shot_table_ex
For 1a, you can just pipe straight from the tibble() function to count(). ie.
tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
count(Team_Name,Date)
In 1b, count() is using your column n (ie. the number of shots) as a weighting variable so is summing the total number of shots per team, not the number of rows. It prints a message telling you this:
Using `n` as weighting variable i Quiet this message with `wt = n` or count rows with `wt = 1`
Using count(Team_Name, wt=n()) will give the behaviour you want.
Edit: part 2
shot_table_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
Date = sample(as.Date(c("2019-08-01",
"2019-09-01",
"2018-08-01",
"2018-09-01",
"2017-08-01",
"2017-09-01")),
size = 250, replace = TRUE),
Type = sample(c("shot","goal"), size = 250,
replace = TRUE, prob = c(0.9,0.1))) %>%
group_by(Team_Name) %>%
summarise(n_shots = n(),
n_goals = sum(Type == "goal"),
n_games = n_distinct(Date)) %>%
mutate(Shots_per_Game = round(n_shots / n_games, 1),
Goals_per_Game = round(n_goals / n_games, 1))
1a. From the example data below and using dplyr, is there a way to calculate the games(dates) per team without using an intermediate table?
This is how I would do it:
shot_df_ex %>%
distinct(Team_Name, Date) %>% #Keeps only the cols given and one of each combo
count(Team_Name)
You can also use unique:
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(N_Games = length(unique(Date))
1b. I've included my original way to calculate n_games which didn't
work. Why?
Your code is working for me. Did you perhaps save over the intermediate table? It's counting the expected 6 per team.
Below is my process of creating a summary table. I understand that piping is meant to cut out the creation of some intermediate
variables/tables. Where could I combine steps below to create the
final table with a minimum number of intermediate steps?
shot_df_ex %>%
group_by(Team_Name) %>%
summarize(
N_Games = length(unique(Date)),
N_Shots = sum(Type == "shot"),
N_Goals = sum(Type == "goal")
) %>%
mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
Goals_per_Game = round(N_Goals / N_Games, 1))
You can use multiple summarize steps at a time as long as you don't need to change your grouping. We're taking advantage here (in the sum calls) of the interpretation of True as 1 and False as 0. length will of course give us the length of the vector produced by unique.
this (count) works, but isn't count() just a quicker way to run group_by() %>% summarise()?
count is just a combination of group_by(col) %>% tally() and tally is essentially summarize(x=n()) so yes. :)

Question with using time series in R for forecasting via example

Im working through this example. However, when I begin investigating the tk_ts output I don't think it is taking the start/end data Im entering correctly, but am unsure as to what the proper input is if I want it to start at 12-31-2019 and end at 7-17-2020:
daily_cases2 <- as_tibble(countrydatescases) %>%
mutate(Date = as_date(date)) %>%
group_by(country, Date) %>%
summarise(total_cases = sum(total_cases))
daily_cases2$total_cases <- as.double(daily_cases2$total_cases)
# Nest
daily_cases2_nest <- daily_cases2 %>%
group_by(country) %>%
tidyr::nest()
# TS
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = 2019-12-31,
freq = 1))
Here is what I get when I examine it closely:
When I go through the example steps with these parameters the issue is also then seen in the subsequent graph:
I've tried varying the frequency and start parameters and its just not making sense. Any suggestions?
You've given the start and end dates, but you haven't said what frequency you want. Given that you want the series to start at the end of 2019 and end in the middle of July, 2020, I'm guessing you want a daily time series. In that case, the code should be:
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = c(2019, 365), # day 365 of year 2019
freq = 365)) # daily series

R - Difference between two rows conditional on variables

Lately for everything I manage on my own, I feel like I have a point I need to clarify on this board. SO thanks in advance for your help.
The issue is as follows:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90,30,70,90,220,35,70,150,250,10,50,25)
)
Data<-Data[, totalcont := sum(Contract_Size), by=c("Code_ID_Buy","Month")]
View(Data)
I would like to calculate the difference in total contract size from one period to the next, does anybody know whether it is possible ? I have been working on that for weeks not finding the solution...
Kind regards,
This should solve it
library(tidyverse)
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90)
)
Data %>%
group_by(Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(dif_contract = total_contract -lag(total_contract))
If you need to respect another variable do
Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
If you need to maintain the current structure left_join may be right
result <- Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
Data %>%
left_join(result)

Count of crashes and injuries?

I have a dataset from dot.gov website that I have to analyze as part of our school project. It contains a lot of information, but I am just focusing on crashes and injuries. How do I count the number of crashes or injuries from the year 2007-2014 for example?
Do I have to subset my data per year or is there a more efficient way to do it? Thank you!
Below is a sample of my dataset:
Without a reproducible example of your dataset on which we can test our code, it is difficult to be sure that it will be working, but using dplyr and lubridate package, you can try (assuming that your dataset is called df):
library(dplyr)
library(lubridate)
df %>% mutate(YEARTXT = ymd(YEARTXT)) %>%
mutate(Year = year(YEARTXT)) %>%
filter(Year %in% 2007:2014) %>%
summarise(INJURED = sum(INJURED, na.rm = FALSE),
CRASH = sum(CRASH == "Y"))
To get the count of Crash and injured by per year, you can add group_by to the following sequence such as:
df %>% mutate(YEARTXT = ymd(YEARTXT)) %>%
mutate(Year = year(YEARTXT)) %>%
group_by(Year) %>%
filter(Year %in% 2007:2014) %>%
summarise(INJURED = sum(INJURED, na.rm = FALSE),
CRASH = sum(CRASH == "Y"))
If this is not working, please provide a reproducible example of your dataset: How to make a great R reproducible example

Resources