Extract highest value for two other variables - R code - r

Test assignment: Highest amount of free wine given by day/driver combination?
note: need perform this work with tidyverse library only - can't load any other library
Need help with my code:
pizza %>%
select(day,driver,free_wine)%>%
group_by(day,driver)%>%
summarise(n=sum(free_wine,na.rm = TRUE),.groups = 'drop')
My output is not correct (showing multiple line items for each day). I understand that I need a code line to show max value by day and driver here but unable to figure out how to do it without impacting groupby configuration
Expected out should be one row for each day showing max value -
Example
Day Driver n
Friday Sam 20
Thursday Tom 12
Wenesday Rick 15

Try the following. After adding the free_wine amounts per driver and day, filter the maximal values.
pizza %>%
group_by(day, driver) %>%
summarise(free_wine = sum(free_wine, na.rm = TRUE), .groups = 'keep') %>%
filter(free_wine == max(free_wine))
An alternative is to drop the groups an group again but this time by year only.
pizza %>%
group_by(day, driver)%>%
summarise(free_wine = sum(free_wine,na.rm = TRUE),.groups = 'drop') %>%
group_by(day) %>%
filter(free_wine == max(free_wine))

Related

Recurring investment using R and PerformanceAnalytics

I am using R and PerformanceAnalytics to calculate the portfolio returns of a strategy.
Specifically, I want to start with $3000, invested equally across available assets, while adding a recurring $1000 split equally across available assets in January and June. The previous investments should not be reallocated, just the $1000 equally split each rebalance.
The below code can be used to calculate the growth in returns when investing $3000 initially, and rebalancing six monthly, but does not allow recurring investments split across assets.
Specifically, I want to split the additional investment amount ($1000 six monthly) to each stock at rebalancing, without reallocating what has already been allocated to the stocks.
The following does not achieve this, but gives a starting point for someone able to assist:
library(tidyverse);library(PerformanceAnalytics);library(tbl2xts)
data(managers)
df_series <- managers[,1:3] %>% xts_tbl()
w_xts <-
df_series %>% filter(format(date, "%b") %in% c("Jan", "Jun")) %>%
gather(fund, value, -date) %>%
mutate(value = coalesce(value, 0)) %>%
mutate(value = ifelse(abs(value) == 0, 0, 1)) %>% arrange(date) %>%
group_by(date) %>% mutate(value = value / sum(value)) %>% ungroup() %>% tbl_xts(cols_to_xts = value, spread_by = fund)
r_xts <- df_series %>% tbl_xts()
r_xts[is.na(r_xts)] <- 0
portfolio_return <- PerformanceAnalytics::Return.portfolio(R = r_xts, weights = w_xts, value = 3000, verbose = T)

Compute a custom mean for each row over multiple columns, based on a set of conditions

I have a complex problem and I will be grateful if someone can help me out. I have a dataframe made up of appended survey data for different countries in different years. In the said dataframe, I also have air quality measures for the neighbourhoods where respondents were selected. The air quality data is from 1998 to 2016.
My problem is I want to compute the row mean (or cumulative mean exposures) for each person base on the respondents' age and the air quality data years. My data frame looks like this
dat <- data.frame(ID=c(1:2000), dob = sample(1990:2020, size=2000, replace=TRUE),
survey_year=rep(c(1998, 2006, 2008, 2014, 2019), times=80, each=5),
CNT = rep(c('AO', 'GH', 'NG', 'SL', 'UG'), times=80, each=5),
Ozone_1998=runif(2000), Ozone_1999=runif(2000), Ozone_2000=runif(2000),
Ozone_2001=runif(2000), Ozone_2002=runif(2000), Ozone_2003=runif(2000),
Ozone_2004=runif(2000), Ozone_2005=runif(2000), Ozone_2006=runif(2000),
Ozone_2007=runif(2000), Ozone_2008=runif(2000), Ozone_2009=runif(2000),
Ozone_2010=runif(2000), Ozone_2011=runif(2000), Ozone_2012=runif(2000),
Ozone_2013=runif(2000), Ozone_2014=runif(2000), Ozone_2015=runif(2000),
Ozone_2016=runif(2000))
In the example data frame above, all respondents in country Ao will have their cumulative mean air quality exposures restricted to the Ozone_1998 while respondents in country SL will have their mean calculated based on Ozone_1998 to Ozone_2014.
The next thing is for a person in country SL aged 15 years I want to their cumulative exposure to be from Ozone_2000 to Ozone_2014 (the 15 year period of their life include their birth year). A person aged 16 will have their mean from Ozone_1999 to Ozone_2014 etc.
Is their a way to do this complex task in R?
NB: Although my question is similar to another I posted (see link below), this task is much complex. I tried adapting the solution for my previous question but my attempts did not work. For instance, I tried
dat$mean_exposure = dat %>% pivot_longer(starts_with("Ozone"), names_pattern = "(.*)_(.*)", names_to = c("type", "year")) %>%
mutate(year = as.integer(year)) %>% group_by(ID) %>%
summarize(mean_under5_ozone = mean(value[ between(year, survey_year,survey_year + 0) ]), .groups = "drop")
but got an error
*Error: Problem with `summarise()` input `mean_under5_ozone`.
x `left` must be length 1
i Input `mean_under5_ozone` is `mean(value[between(year, survey_year, survey_year + 0)])`.
i The error occurred in group 1: ID = 1.*
Link to the previous question
How to compute a custom mean for each row over multiple columns, based on a row-specific criterion?
Thank you
The tidying step from your last question works well:
tidy_data = dat %>%
pivot_longer(
starts_with("Ozone"),
names_pattern = "(.*)_(.*)",
names_to = c(NA, "year"),
values_to = "ozone"
) %>%
mutate(year = as.integer(year))
Now you can filter out the years you want to get mean exposure by country / age:
mean_lifetime_exposure = tidy_data %>%
group_by(CNT, dob) %>%
filter(year >= dob) %>%
summarise(mean(ozone))
PS I'm sorry I don't quite understand your first question about country AO.
Edit:
Does this do what you wanted? The logic is a bit convoluted but the code is straightforward.
tidy_data_filtered = tidy_data %>%
filter(
!(CNT == "AO" & year != 1998),
!(CNT == "SL" & !year %in% 1998:2014)
)

R: Home sales in the last year before each sale

As a follow-up question to a previous one in the same project:
I found that real estate is often measured in inventory time, which is defined as (number of active listings) / (number of homes sale per month, as average over the last 12 months). The best way I could find to count the number of homes sold in the last 12 months before each home sale is through a for-loop.
homesales$yearlysales = 0
for (i in 1:nrow(homesales))
{
sdt = as.Date(homesales$saledate[i])
x <- homesales %>% filter( sdt - saledate >= 0 & sdt - saledate < 365) %>% summarise(count=n())
homesales$yearlysales[i] =x$count[1]
}
homesales$inventorytime = homesales$inventory / homesales$yearlysales * 12
homesales$inventorytime[is.na(homesales$saledate)] = NA
homesales$inventorytime[homesales$yearlysales==0] = NA
Obviously (?), the R language has some prejudice against using a for-loop for doing this type of selections. Is there a better way?
Appendix 1. data table structure
address, listingdate, saledate
101 Street, 2017/01/01, 2017/06/06
106 Street, 2017/03/01, 2017/08/11
102 Street, 2017/05/04, 2017/06/13
109 Street, 2017/07/04, 2017/11/24
...
Appendix 2. The output I'm looking for is something like this.
The following gives you the number of active listings on any given day:
library(tidyverse)
library(lubridate)
tmp <- tempfile()
download.file("https://raw.githubusercontent.com/robhanssen/glenlake-homesales/master/homesalesdata-source.csv", tmp)
data <- read_csv(tmp) %>%
select(ends_with("date")) %>%
mutate(across(everything(), mdy)) %>%
pivot_longer(cols = everything(), names_to = "activity", values_to ="date", names_pattern = "(.*)date")
active <- data %>%
mutate(active = if_else(activity == "listing", 1, -1)) %>%
arrange(date) %>%
mutate(active = cumsum(active)) %>%
group_by(date) %>%
filter(row_number() == n()) %>%
select(-activity)
tibble(date = seq(min(data$date, na.rm = TRUE), max(data$date, na.rm = TRUE), by = "days")) %>%
left_join(active) %>%
fill(active)
Basically, we pivot longer and split each row of data into two rows indicating distinct activities: adding a listing or removing a listing. Then the cumulative sum of this gives you the number of active listings.
Note, this assumes that you are not missing any data. Depending on the specification from which the csv was made, you could be missing activity at the start or end. But this is a warning about the csv itself.
Active listings is a fact about an instant in time. Sales is a fact about a time period. You probably want to aggregate sales by month, and then use the number of active listings from the last day of the month, or perhaps the average number of listings over that month.

How do I take one column from a data set and columns from anther data set to filter out a something specific?

I am currently working in the nycflights13 package. I want to test one variable/column from one data set with a column from other data set.
nycflights13 has a data set called "planes" and another data set called "flights". Both data sets have a column variable called "tailnum" which is essentially the name of the plane. The age of the plane is located in "planes" while the place of origin for the flight is located in "flights". How would I write code in r that could tell me which is the oldest plane that flew out of JFK? Basically I need to take the column in "flights" that has the origins of the planes (where they are flying out of), filter out everything that isn't JKF, and then sort them descending by age, which is a column in "planes".
Welcome to Stack Overflow!
How about this:
library(nycflights13)
library(conflicted) # useful for dealing with functions in several packages
suppressMessages(conflict_prefer("filter", "dplyr")) # use filter from the dplyr package
suppressPackageStartupMessages(library(tidyverse)) # load tidyverse without messages
data("flights") # from nycflights13
theflights <-
flights %>%
filter(origin == "JFK") %>% # get the filghts from JFK
select (tailnum) # only keep the airplane number
data("planes") # from nycflights13
# if you want the age
inner_join(theflights, planes) %>% # select the records from planes with matching tailnums defined above
filter(year == min(year, na.rm = TRUE)) %>% # get the oldest airplane
select(year) %>% # keep the year
distinct() %>% # remove duplicates
mutate(age = 2013 - year) %>% # calculate the age in 2013
pull(age) # pull age into its own vector
# or
inner_join(theflights, planes) %>%
summarize(theYear = min(year, na.rm = TRUE)) %>%
mutate(age = 2013 - theYear) %>%
pull(age)
# if you want the oldest airplane(s) tail numbers
inner_join(theflights, planes) %>% # select the records from planes with matching tailnums defined above
filter(year == min(year, na.rm = TRUE)) %>% # you can use T or TRUE but TRUE is prefered
select(tailnum) %>%
distinct()
# if you want the oldest airplane(s) model
theModel <-
inner_join(theflights, planes) %>% # select the records from planes with matching tailnums defined above
filter(year == min(year, na.rm = TRUE)) %>% # you can use T or TRUE but TRUE is prefered
select(model) %>%
distinct() %>%
pull()
# get all the airplanes of that model
DC7BFs <-
planes %>%
filter(model == theModel)
The filter() statement starts by finding the smallest (i.e., minimum) year after it drops the missing (i.e., NA) year values. Then it finds every record that matches that year.
In theory you can use T or TRUE for logic checks. However, TRUE is recommended by the R style guide.

How to build recommendation model for calling prospects

My goal is to better target prospects at a higher call success rate, based on time of day and prior history.
I have created a "Prodprobability" column showing the probability of a PropertyID answering the phone at that hour for the history of calls. Instead of merely omitting Property ID 233303.13 from any calls, I want to retarget them into hour 13 or hour 16 (the sample data doesn't show but the probability of pickup at those hours are 100% and 25% respectively).
So, moving forward, based on hour of day, and history of that prospect picking up the phone or not during that hour, I'd like to re-target every prospect during the hours they're most likely to pick up.
sample data
EDIT: I guess I need a formula to do this: If "S425=0", I want to search for where "A425" has the highest probability in the S column, and return the hour and probability for that "PropertyID". Hopefully that makes sense.
EDIT: :sample date returns this
The question here would be are you dead set on creating a 'model' or an automation works for you?
I would suggest ordering the dataframe by probability of picking the call every hour (so you can give the more probable leads first) and then further sorting them by number of calls on that day.
Something along the lines of:
require(dplyr)
todaysCall = df %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(noOfCalls = n())
hourlyCalls = df %>%
dplyr::filter(hour == format(Sys.time(),"%H")) %>%
dplyr::left_join(todaysCall) %>%
dplyr::arrange(desc(Prodprobability),noOfCalls)
Essentially, getting the probability of pickups are what models are all about and you already seem to have that information.
Alternate solution
Get top 5 calling times for each propertyID
top5Times = df %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:5L) %>%
dplyr::ungroup()
Get alternate calling time for cases with zero Prodprobability:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
alternateTimes = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::arrange(propertyID,desc(Prodprobability))
Best calling hour for cases with zero probability at given time:
#Identifies the zero prob cases; can be by hour or at a particular instant
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
#Gets the highest calling probability and corresponding closest hour if probability is same for more than one timeslot
bestTimeForZero = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability),hour) %>%
dplyr::slice(1L) %>%
dplyr::ungroup()
Returning number of records as per original df:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(total = n())
bestTimesList = lapply(1:nrow(zeroProb),function(i){
limit = zeroProb$total[i]
bestTime = df %>%
dplyr::filter(propertyID == zeroProb$propertyID[i]) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:limit)
return(bestTime)
})
bestTimeDf = bind_rows(bestTimesList)
Note: You can combine the filter statements; I have written them separate to highlight what each step does.

Resources