R: Home sales in the last year before each sale - r

As a follow-up question to a previous one in the same project:
I found that real estate is often measured in inventory time, which is defined as (number of active listings) / (number of homes sale per month, as average over the last 12 months). The best way I could find to count the number of homes sold in the last 12 months before each home sale is through a for-loop.
homesales$yearlysales = 0
for (i in 1:nrow(homesales))
{
sdt = as.Date(homesales$saledate[i])
x <- homesales %>% filter( sdt - saledate >= 0 & sdt - saledate < 365) %>% summarise(count=n())
homesales$yearlysales[i] =x$count[1]
}
homesales$inventorytime = homesales$inventory / homesales$yearlysales * 12
homesales$inventorytime[is.na(homesales$saledate)] = NA
homesales$inventorytime[homesales$yearlysales==0] = NA
Obviously (?), the R language has some prejudice against using a for-loop for doing this type of selections. Is there a better way?
Appendix 1. data table structure
address, listingdate, saledate
101 Street, 2017/01/01, 2017/06/06
106 Street, 2017/03/01, 2017/08/11
102 Street, 2017/05/04, 2017/06/13
109 Street, 2017/07/04, 2017/11/24
...
Appendix 2. The output I'm looking for is something like this.

The following gives you the number of active listings on any given day:
library(tidyverse)
library(lubridate)
tmp <- tempfile()
download.file("https://raw.githubusercontent.com/robhanssen/glenlake-homesales/master/homesalesdata-source.csv", tmp)
data <- read_csv(tmp) %>%
select(ends_with("date")) %>%
mutate(across(everything(), mdy)) %>%
pivot_longer(cols = everything(), names_to = "activity", values_to ="date", names_pattern = "(.*)date")
active <- data %>%
mutate(active = if_else(activity == "listing", 1, -1)) %>%
arrange(date) %>%
mutate(active = cumsum(active)) %>%
group_by(date) %>%
filter(row_number() == n()) %>%
select(-activity)
tibble(date = seq(min(data$date, na.rm = TRUE), max(data$date, na.rm = TRUE), by = "days")) %>%
left_join(active) %>%
fill(active)
Basically, we pivot longer and split each row of data into two rows indicating distinct activities: adding a listing or removing a listing. Then the cumulative sum of this gives you the number of active listings.
Note, this assumes that you are not missing any data. Depending on the specification from which the csv was made, you could be missing activity at the start or end. But this is a warning about the csv itself.
Active listings is a fact about an instant in time. Sales is a fact about a time period. You probably want to aggregate sales by month, and then use the number of active listings from the last day of the month, or perhaps the average number of listings over that month.

Related

how to use r to filter data that has multiple condition

I have this set of data, where firms receives subsidy in different years, and I want to find the firms (ID) that never received any subsidy, i.e. sub = 0 in year 2010-2021.
I tried to use filter(firms$sub==0), but i don't know how to find the ID that show all zero during year 2010-2021.
We may group by 'ID', and filter only those having all the sub values are 0 within the year range of 2010 to 2021 and extract the distinct ID
library(dplyr)
firms %>%
group_by(ID) %>%
filter(all(sub[year %in% 2010:2021] ==0, na.rm = TRUE)) %>%
ungroup %>%
distinct(ID)

Extract highest value for two other variables - R code

Test assignment: Highest amount of free wine given by day/driver combination?
note: need perform this work with tidyverse library only - can't load any other library
Need help with my code:
pizza %>%
select(day,driver,free_wine)%>%
group_by(day,driver)%>%
summarise(n=sum(free_wine,na.rm = TRUE),.groups = 'drop')
My output is not correct (showing multiple line items for each day). I understand that I need a code line to show max value by day and driver here but unable to figure out how to do it without impacting groupby configuration
Expected out should be one row for each day showing max value -
Example
Day Driver n
Friday Sam 20
Thursday Tom 12
Wenesday Rick 15
Try the following. After adding the free_wine amounts per driver and day, filter the maximal values.
pizza %>%
group_by(day, driver) %>%
summarise(free_wine = sum(free_wine, na.rm = TRUE), .groups = 'keep') %>%
filter(free_wine == max(free_wine))
An alternative is to drop the groups an group again but this time by year only.
pizza %>%
group_by(day, driver)%>%
summarise(free_wine = sum(free_wine,na.rm = TRUE),.groups = 'drop') %>%
group_by(day) %>%
filter(free_wine == max(free_wine))

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

R Dplyr sub-setting

I need to calculate min, max and mean by customer after sub-setting the population for primary contacts. To do this, I need to drop observations within a customer group if contact == relation and amount < 25. But, the tricky part is: if contact == relation and amount == amount, I need to keep both observations regardless the amount (this accounts for ties where we cannot define the primary contact).
If contact == relation, one can think of this as a household.
Each customer can be comprised of multiple households, so I've included contacts with NULL relationship values.
Sample Data
customer <- c(1,1,1,1,2,2,2,3,3,3,3)
contact <- c(1234,2345,3456,4567,5678,6789,7890,8901,9012,1236,2346)
relationship <- c(2345,1234,"","",6789,5678,"",9012,8901,2346,1236)
amount <- c(26,22,40,12,15,15,70,35,15,25,25)
score <- c(500,300,700,600,400,600,700,650,300,600,700)
creditinfoaggtestdata1 <- data.frame(customer,contact,relationship,amount,score)
Expected Outcome
As a point of reference, if I do not drop the appropriate contacts prior to calculating min, max and mean, by customer, I get an output table as follows:
I assume the requirement "contact = relation and amount = amount" means across different rows within the same customer group. Here's a dplyr solution:
# Create a contact-relationship id where direction doesn't matter
df <- creditinfoaggtestdata1 %>%
rowwise() %>%
mutate(id = paste0(min(contact, relationship), max(contact, relationship)))
# Filter new ID's where duplicates in amounts exist per customer group
dups <- df %>%
group_by(customer, id, amount) %>%
summarise(count = n()) %>%
filter(count > 1) %>%
ungroup() %>%
select(customer, id)
# User inner join to only select contact-relationship combinations from above
a <- df %>%
filter(amount < 25) %>%
inner_join(dups, by=c("customer", "id"))
# Combine with >= 25 data
b <- df %>%
filter(amount >= 25)
c <- rbind(a, b)
c %>%
group_by(customer) %>%
summarise(min_score = min(score), max_score = max(score), avg_score = mean(score))
Output:
customer min_score max_score avg_score
<dbl> <dbl> <dbl> <dbl>
1 1 500 700 600
2 2 400 700 567.
3 3 600 700 650

How to build recommendation model for calling prospects

My goal is to better target prospects at a higher call success rate, based on time of day and prior history.
I have created a "Prodprobability" column showing the probability of a PropertyID answering the phone at that hour for the history of calls. Instead of merely omitting Property ID 233303.13 from any calls, I want to retarget them into hour 13 or hour 16 (the sample data doesn't show but the probability of pickup at those hours are 100% and 25% respectively).
So, moving forward, based on hour of day, and history of that prospect picking up the phone or not during that hour, I'd like to re-target every prospect during the hours they're most likely to pick up.
sample data
EDIT: I guess I need a formula to do this: If "S425=0", I want to search for where "A425" has the highest probability in the S column, and return the hour and probability for that "PropertyID". Hopefully that makes sense.
EDIT: :sample date returns this
The question here would be are you dead set on creating a 'model' or an automation works for you?
I would suggest ordering the dataframe by probability of picking the call every hour (so you can give the more probable leads first) and then further sorting them by number of calls on that day.
Something along the lines of:
require(dplyr)
todaysCall = df %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(noOfCalls = n())
hourlyCalls = df %>%
dplyr::filter(hour == format(Sys.time(),"%H")) %>%
dplyr::left_join(todaysCall) %>%
dplyr::arrange(desc(Prodprobability),noOfCalls)
Essentially, getting the probability of pickups are what models are all about and you already seem to have that information.
Alternate solution
Get top 5 calling times for each propertyID
top5Times = df %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:5L) %>%
dplyr::ungroup()
Get alternate calling time for cases with zero Prodprobability:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
alternateTimes = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::arrange(propertyID,desc(Prodprobability))
Best calling hour for cases with zero probability at given time:
#Identifies the zero prob cases; can be by hour or at a particular instant
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
#Gets the highest calling probability and corresponding closest hour if probability is same for more than one timeslot
bestTimeForZero = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability),hour) %>%
dplyr::slice(1L) %>%
dplyr::ungroup()
Returning number of records as per original df:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(total = n())
bestTimesList = lapply(1:nrow(zeroProb),function(i){
limit = zeroProb$total[i]
bestTime = df %>%
dplyr::filter(propertyID == zeroProb$propertyID[i]) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:limit)
return(bestTime)
})
bestTimeDf = bind_rows(bestTimesList)
Note: You can combine the filter statements; I have written them separate to highlight what each step does.

Resources