Pivot_longer on a file that updates daily - r

Trying to pivot a data frame of Covid-19 deaths imported each day from the web (Johns Hopkins data). The current file is 414 columns wide, growing by one column per day. Pivot_longer works when I specify the width by column index but triggers an error when I try last_col().
For example, this works:
CountyDeathsC <- CountyDeathsB %>%
pivot_longer(cols = c(4:414), names_to="Date", values_to="Cumulative Deaths") %>%
group_by(FIPS, Population, Combined_Key) %>%
mutate(Date = mdy(Date)) %>%
mutate(DeathsToday = `Cumulative Deaths` - dplyr::lag(`Cumulative Deaths`,
n = 1, default = 0),
Deaths7DayAvg = round(zoo::rollapplyr(DeathsToday, 7, mean, na.rm=TRUE, fill=NA))) %>%
mutate(CumDeathsPer100k = `Cumulative Deaths` / (Population / 100000))
This code (excerpt) does not:
pivot_longer(cols = c(4:last_col()), names_to="Date", values_to="Cumulative Deaths")
I get an error saying the term "last_col()" is not recognized. So it looks like I have to go in each day and manually insert the index for the last column. Or is there a better answer?

Related

Recurring investment using R and PerformanceAnalytics

I am using R and PerformanceAnalytics to calculate the portfolio returns of a strategy.
Specifically, I want to start with $3000, invested equally across available assets, while adding a recurring $1000 split equally across available assets in January and June. The previous investments should not be reallocated, just the $1000 equally split each rebalance.
The below code can be used to calculate the growth in returns when investing $3000 initially, and rebalancing six monthly, but does not allow recurring investments split across assets.
Specifically, I want to split the additional investment amount ($1000 six monthly) to each stock at rebalancing, without reallocating what has already been allocated to the stocks.
The following does not achieve this, but gives a starting point for someone able to assist:
library(tidyverse);library(PerformanceAnalytics);library(tbl2xts)
data(managers)
df_series <- managers[,1:3] %>% xts_tbl()
w_xts <-
df_series %>% filter(format(date, "%b") %in% c("Jan", "Jun")) %>%
gather(fund, value, -date) %>%
mutate(value = coalesce(value, 0)) %>%
mutate(value = ifelse(abs(value) == 0, 0, 1)) %>% arrange(date) %>%
group_by(date) %>% mutate(value = value / sum(value)) %>% ungroup() %>% tbl_xts(cols_to_xts = value, spread_by = fund)
r_xts <- df_series %>% tbl_xts()
r_xts[is.na(r_xts)] <- 0
portfolio_return <- PerformanceAnalytics::Return.portfolio(R = r_xts, weights = w_xts, value = 3000, verbose = T)

Aggregation and mean calculation with dplyr

I have a chunk of code that aggregates timestamps of a large dataset (see below). Each timestamp represents a tweet. The code aggregates the tweets per week, it works fine. Now, I also have a column with the sentiment value of each tweet. I would like to know if it is possible to calculate the mean sentiment of the tweets per week. It would be nice to have at the end one dataset with the amount of tweets per week and the mean sentiment of these aggregated tweets. Please let me know if you've got some hints :)
Kind regards,
Daniel
weekly_counts_2 <- df_bw %>%
drop_na(Timestamp) %>%
mutate(weekly_cases = floor_date(
Timestamp,
unit = "week")) %>%
count(weekly_cases) %>%
tidyr::complete(
weekly_cases = seq.Date(
from = min(weekly_cases),
to = max(weekly_cases),
by = "week"),
fill = list(n = 0))
It is difficult to verify the answer since no data has been shared but based on the description provided here is a solution that you can try.
library(dplyr)
library(tidyr)
library(lubridate)
weekly_counts_2 <- df_bw %>%
drop_na(Timestamp) %>%
mutate(weekly_cases = floor_date(Timestamp,unit = "week")) %>%
group_by(weekly_cases) %>%
summarise(mean_sentiment = mean(sentiment_value, na.rm = TRUE),
count = n()) %>%
complete(weekly_cases = seq.Date(min(weekly_cases),
max(weekly_cases),by = "week"), fill = list(n = 0))
I have assumed the column with the sentiment value is called sentiment_value, change it accordingly to your data.

Conditionally calculating average time between events by group in R

I am working with a call log data set from a telephone hotline service. There are three call outcomes: Answered, Abandoned & Engaged. I am trying to find out the average time taken by each caller to contact the hotline again if they abandoned the previous call. The time difference can be either seconds, minutes, hours or days but I would like to get all four if possible.
Here is some mock data with the variables I am working with:-
library(wakefield)#for generating the Status variable
library(dplyr)
library(stringi)
library(Pareto)
library(uuid)
n_users<-1300
n_rows <- 365000
set.seed(1)
#data<-data.frame()
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(rep(Date,each=1000),replace = T)
u <- runif(length(Date), 0, 60*60*12) # "noise" to add or subtract from some timepoint
CallDateTime<-as.POSIXlt(u, origin = paste0(Date,"00:00:00"))
CallDateTime
CallOutcome<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=length(Date))
CallOutcome
data<-data.frame(Date,CallDateTime,CallOutcome)
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
data$CallerId <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
data<-data%>%arrange(CallDateTime)
head(data)
So to reiterate, if a caller abandons their call (represented by "Abandoned" in the CallOutcome column), I would like to know the average time taken for the caller to make another call to the service, in the four time units I have mentioned. Any pointers on how I can achieve this would be great :)
Keep rows in the data where the current row is "Abandoned" and the next row is not "Abandoned" for each ID. Find difference in time between every 2 rows to get time required for the caller to make another call to service after it was abandoned, take average of each of the duration to get average time.
library(dplyr)
data %>%
#Test the answer on smaller subset
#slice(1:1000) %>%
arrange(CallerId, CallDateTime) %>%
group_by(CallerId) %>%
filter(CallOutcome == 'Abandoned' & dplyr::lead(CallOutcome) != 'Abandoned' |
CallOutcome != 'Abandoned' & dplyr::lag(CallOutcome) == 'Abandoned') %>%
mutate(group = rep(row_number(), each = 2, length.out = n())) %>%
group_by(group, .add = TRUE) %>%
summarise(avg_sec = difftime(CallDateTime[2], CallDateTime[1], units = 'secs')) %>%
mutate(avg_sec = as.numeric(mean(avg_sec)),
avg_min = avg_sec/60,
avg_hour = avg_min/60,
avg_day = avg_hour/24) -> result
result
First, I would create the lead variable (basically calculate what is the "next" value by group. Then it's just as easy as using whatever unit you want for difftime. A density plot can help you analyze these differences, as shown below.
data <-
data %>%
group_by(CallerId) %>%
mutate(CallDateTime_Next = lead(CallDateTime)) %>%
ungroup() %>%
mutate(
diff_days = difftime(CallDateTime_Next, CallDateTime, units = 'days'),
diff_hours = difftime(CallDateTime_Next, CallDateTime, units = 'hours'),
diff_mins = difftime(CallDateTime_Next, CallDateTime, units = 'mins'),
diff_secs = difftime(CallDateTime_Next, CallDateTime, units = 'secs')
)
data %>%
filter(CallOutcome == 'Abandoned') %>%
ggplot() +
geom_density(aes(x = diff_days))

Question with using time series in R for forecasting via example

Im working through this example. However, when I begin investigating the tk_ts output I don't think it is taking the start/end data Im entering correctly, but am unsure as to what the proper input is if I want it to start at 12-31-2019 and end at 7-17-2020:
daily_cases2 <- as_tibble(countrydatescases) %>%
mutate(Date = as_date(date)) %>%
group_by(country, Date) %>%
summarise(total_cases = sum(total_cases))
daily_cases2$total_cases <- as.double(daily_cases2$total_cases)
# Nest
daily_cases2_nest <- daily_cases2 %>%
group_by(country) %>%
tidyr::nest()
# TS
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = 2019-12-31,
freq = 1))
Here is what I get when I examine it closely:
When I go through the example steps with these parameters the issue is also then seen in the subsequent graph:
I've tried varying the frequency and start parameters and its just not making sense. Any suggestions?
You've given the start and end dates, but you haven't said what frequency you want. Given that you want the series to start at the end of 2019 and end in the middle of July, 2020, I'm guessing you want a daily time series. In that case, the code should be:
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = c(2019, 365), # day 365 of year 2019
freq = 365)) # daily series

Plotting frequency of occurrences based on start/end times in R

I have a "trips" dataset that includes a unique trip id, and a start and end time (the specific hour and minute) of the trips. These trips were all taken on the same day. I am trying to determine the number of cars on the road at any given time and plot it as a line graph using ggplot in R. In other words, a car is "on the road" at any time in between its start and end time.
The most similar example I can find uses the following structure:
yearly_counts <- trips %>%
count(year, trip_id)
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line()
Would the best approach be to modify this structure have an "minutesByHour_count" variable that has a count for every minute of every hour? This seems inefficient to me, and still doesn't solve the problem of getting the counts from the start/end time.
Is there any easier way to do this?
Here's an example based on counting each start as an additional car, and each end as a reduction in the count:
library(tidyverse)
df %>%
gather(type, time, c(start_hour, end_hour)) %>%
mutate(count_chg = if_else(type == "start_hour", 1, -1)) %>%
arrange(time) %>%
mutate(car_count = cumsum(count_chg)) %>%
ggplot(aes(time, car_count)) +
geom_step()
Sample data:
df <- data.frame(
uniqueID = 1:60,
start_hour = seq(8, 12, length.out = 60),
dur_hour = 0.05*1:60
)
df$end_hour = df$start_hour + df$dur_hour
df$dur_hour = NULL

Resources