Recurring investment using R and PerformanceAnalytics - r

I am using R and PerformanceAnalytics to calculate the portfolio returns of a strategy.
Specifically, I want to start with $3000, invested equally across available assets, while adding a recurring $1000 split equally across available assets in January and June. The previous investments should not be reallocated, just the $1000 equally split each rebalance.
The below code can be used to calculate the growth in returns when investing $3000 initially, and rebalancing six monthly, but does not allow recurring investments split across assets.
Specifically, I want to split the additional investment amount ($1000 six monthly) to each stock at rebalancing, without reallocating what has already been allocated to the stocks.
The following does not achieve this, but gives a starting point for someone able to assist:
library(tidyverse);library(PerformanceAnalytics);library(tbl2xts)
data(managers)
df_series <- managers[,1:3] %>% xts_tbl()
w_xts <-
df_series %>% filter(format(date, "%b") %in% c("Jan", "Jun")) %>%
gather(fund, value, -date) %>%
mutate(value = coalesce(value, 0)) %>%
mutate(value = ifelse(abs(value) == 0, 0, 1)) %>% arrange(date) %>%
group_by(date) %>% mutate(value = value / sum(value)) %>% ungroup() %>% tbl_xts(cols_to_xts = value, spread_by = fund)
r_xts <- df_series %>% tbl_xts()
r_xts[is.na(r_xts)] <- 0
portfolio_return <- PerformanceAnalytics::Return.portfolio(R = r_xts, weights = w_xts, value = 3000, verbose = T)

Related

Pivot_longer on a file that updates daily

Trying to pivot a data frame of Covid-19 deaths imported each day from the web (Johns Hopkins data). The current file is 414 columns wide, growing by one column per day. Pivot_longer works when I specify the width by column index but triggers an error when I try last_col().
For example, this works:
CountyDeathsC <- CountyDeathsB %>%
pivot_longer(cols = c(4:414), names_to="Date", values_to="Cumulative Deaths") %>%
group_by(FIPS, Population, Combined_Key) %>%
mutate(Date = mdy(Date)) %>%
mutate(DeathsToday = `Cumulative Deaths` - dplyr::lag(`Cumulative Deaths`,
n = 1, default = 0),
Deaths7DayAvg = round(zoo::rollapplyr(DeathsToday, 7, mean, na.rm=TRUE, fill=NA))) %>%
mutate(CumDeathsPer100k = `Cumulative Deaths` / (Population / 100000))
This code (excerpt) does not:
pivot_longer(cols = c(4:last_col()), names_to="Date", values_to="Cumulative Deaths")
I get an error saying the term "last_col()" is not recognized. So it looks like I have to go in each day and manually insert the index for the last column. Or is there a better answer?

R: Home sales in the last year before each sale

As a follow-up question to a previous one in the same project:
I found that real estate is often measured in inventory time, which is defined as (number of active listings) / (number of homes sale per month, as average over the last 12 months). The best way I could find to count the number of homes sold in the last 12 months before each home sale is through a for-loop.
homesales$yearlysales = 0
for (i in 1:nrow(homesales))
{
sdt = as.Date(homesales$saledate[i])
x <- homesales %>% filter( sdt - saledate >= 0 & sdt - saledate < 365) %>% summarise(count=n())
homesales$yearlysales[i] =x$count[1]
}
homesales$inventorytime = homesales$inventory / homesales$yearlysales * 12
homesales$inventorytime[is.na(homesales$saledate)] = NA
homesales$inventorytime[homesales$yearlysales==0] = NA
Obviously (?), the R language has some prejudice against using a for-loop for doing this type of selections. Is there a better way?
Appendix 1. data table structure
address, listingdate, saledate
101 Street, 2017/01/01, 2017/06/06
106 Street, 2017/03/01, 2017/08/11
102 Street, 2017/05/04, 2017/06/13
109 Street, 2017/07/04, 2017/11/24
...
Appendix 2. The output I'm looking for is something like this.
The following gives you the number of active listings on any given day:
library(tidyverse)
library(lubridate)
tmp <- tempfile()
download.file("https://raw.githubusercontent.com/robhanssen/glenlake-homesales/master/homesalesdata-source.csv", tmp)
data <- read_csv(tmp) %>%
select(ends_with("date")) %>%
mutate(across(everything(), mdy)) %>%
pivot_longer(cols = everything(), names_to = "activity", values_to ="date", names_pattern = "(.*)date")
active <- data %>%
mutate(active = if_else(activity == "listing", 1, -1)) %>%
arrange(date) %>%
mutate(active = cumsum(active)) %>%
group_by(date) %>%
filter(row_number() == n()) %>%
select(-activity)
tibble(date = seq(min(data$date, na.rm = TRUE), max(data$date, na.rm = TRUE), by = "days")) %>%
left_join(active) %>%
fill(active)
Basically, we pivot longer and split each row of data into two rows indicating distinct activities: adding a listing or removing a listing. Then the cumulative sum of this gives you the number of active listings.
Note, this assumes that you are not missing any data. Depending on the specification from which the csv was made, you could be missing activity at the start or end. But this is a warning about the csv itself.
Active listings is a fact about an instant in time. Sales is a fact about a time period. You probably want to aggregate sales by month, and then use the number of active listings from the last day of the month, or perhaps the average number of listings over that month.

Question with using time series in R for forecasting via example

Im working through this example. However, when I begin investigating the tk_ts output I don't think it is taking the start/end data Im entering correctly, but am unsure as to what the proper input is if I want it to start at 12-31-2019 and end at 7-17-2020:
daily_cases2 <- as_tibble(countrydatescases) %>%
mutate(Date = as_date(date)) %>%
group_by(country, Date) %>%
summarise(total_cases = sum(total_cases))
daily_cases2$total_cases <- as.double(daily_cases2$total_cases)
# Nest
daily_cases2_nest <- daily_cases2 %>%
group_by(country) %>%
tidyr::nest()
# TS
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = 2019-12-31,
freq = 1))
Here is what I get when I examine it closely:
When I go through the example steps with these parameters the issue is also then seen in the subsequent graph:
I've tried varying the frequency and start parameters and its just not making sense. Any suggestions?
You've given the start and end dates, but you haven't said what frequency you want. Given that you want the series to start at the end of 2019 and end in the middle of July, 2020, I'm guessing you want a daily time series. In that case, the code should be:
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = c(2019, 365), # day 365 of year 2019
freq = 365)) # daily series

How to build recommendation model for calling prospects

My goal is to better target prospects at a higher call success rate, based on time of day and prior history.
I have created a "Prodprobability" column showing the probability of a PropertyID answering the phone at that hour for the history of calls. Instead of merely omitting Property ID 233303.13 from any calls, I want to retarget them into hour 13 or hour 16 (the sample data doesn't show but the probability of pickup at those hours are 100% and 25% respectively).
So, moving forward, based on hour of day, and history of that prospect picking up the phone or not during that hour, I'd like to re-target every prospect during the hours they're most likely to pick up.
sample data
EDIT: I guess I need a formula to do this: If "S425=0", I want to search for where "A425" has the highest probability in the S column, and return the hour and probability for that "PropertyID". Hopefully that makes sense.
EDIT: :sample date returns this
The question here would be are you dead set on creating a 'model' or an automation works for you?
I would suggest ordering the dataframe by probability of picking the call every hour (so you can give the more probable leads first) and then further sorting them by number of calls on that day.
Something along the lines of:
require(dplyr)
todaysCall = df %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(noOfCalls = n())
hourlyCalls = df %>%
dplyr::filter(hour == format(Sys.time(),"%H")) %>%
dplyr::left_join(todaysCall) %>%
dplyr::arrange(desc(Prodprobability),noOfCalls)
Essentially, getting the probability of pickups are what models are all about and you already seem to have that information.
Alternate solution
Get top 5 calling times for each propertyID
top5Times = df %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:5L) %>%
dplyr::ungroup()
Get alternate calling time for cases with zero Prodprobability:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
alternateTimes = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::arrange(propertyID,desc(Prodprobability))
Best calling hour for cases with zero probability at given time:
#Identifies the zero prob cases; can be by hour or at a particular instant
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
#Gets the highest calling probability and corresponding closest hour if probability is same for more than one timeslot
bestTimeForZero = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability),hour) %>%
dplyr::slice(1L) %>%
dplyr::ungroup()
Returning number of records as per original df:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(total = n())
bestTimesList = lapply(1:nrow(zeroProb),function(i){
limit = zeroProb$total[i]
bestTime = df %>%
dplyr::filter(propertyID == zeroProb$propertyID[i]) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:limit)
return(bestTime)
})
bestTimeDf = bind_rows(bestTimesList)
Note: You can combine the filter statements; I have written them separate to highlight what each step does.

Creating Iterated Variables in R

I've looked around and seen some questions similar to mine, but none directly on point. I have a series of presidential election results for various states from 1940 to 2012. They are labeled, in sequence, r1940, d1940, r1944, d1944, r1948, d1948, and so forth.
I want to create a series of two-party vote variables, which are calculated by dividing the number of Democratic votes by the number of republican and democratic votes. So in a df called votes:
d2pv1940 <- (votes$d1940/(votes$d1940+votes$r1940))
Obviously I can do this 18 more times by hand, e.g., d2pv1944<-(votes$d1944/(votes$d1944+votes$r1944)) but obviously that is time consuming and invites errors. I've seen some solutions to similar problems using lapply or for loops, but I'm not really sure how I'd iterate the four variable names in the commands above.
Try something like this:
namest=colnames(votes)
rep=which(substr(namest, 1,1)=="r")
dem=which(substr(namest, 1,1)=="d")
res=votes[,dem]/(votes[,dem]+votes[,rep])
colnames(res)=paste("d2pv",substring(colnames(votes[,dem]),2),sep="")
res
Here's a tidy way to do it:
library(dplyr)
library(rex)
data =
c(1, 2, 2, 1) %>%
setNames(
c("r1940", "d1940", "r1944", "d1944") ) %>%
as.list %>%
as.data.frame
regex_1 =
rex(capture(letter),
capture(digits) )
abbreviations = data_frame(
abbreviation = c("d", "r"),
party = c("democrat", "republican") )
data %>%
gather(variable, value) %>%
extract(variable,
c("abbreviation", "year"),
regex_1) %>%
left_join(abbreviations) %>%
group_by(year) %>%
mutate(total = sum(value),
proportion = value / total ) %>%
select(-abbreviation, -value) %>%
spread(party, proportion)

Resources