Need help ungrouping data frame - r

I have a dataset that I used group as to filter and now I am trying to expand back out the rows that I originally needed to consolidate in order to filter. The data was of a set of sales points by sales date. There is a row for each sales point but many different sales points for the same type of item. So I grouped by item in order to get important information like total sales and average sales price, etc. Now, after I have filtered and found the items with the correct sales frequency etc., I need to ungroup the data in order to see all of the sales points. I have tried using "ungroup()", but either it is not working or I am not doing it right.
top25000 <- read_csv("index_sales_export.csv")
blue_chip_index <- top25000 %>%
select(graded_title,date,price,vcp_card_grade_id) %>%
filter(date >= as.Date("2018-07-08")) %>%
group_by(graded_title) %>%
summarise(Market_Cap = sum(price),Average_Price = mean(price),VCP_Card_Grade_ID = mean(vcp_card_grade_id),count=n())%>%
filter(Market_Cap >= 10000 & count >= 25)%>%
rename(Total_Number_of_Sales = count)%>%
blue_chip_index

If I understand what you want, it sounds like you should just start with the original data frame again, then filter on the items that you want.
Ungrouping the new data frame won’t allow you to unsummarize it.
Does this achieve want you want:
library(tidyverse)
top25000 <- read_csv("index_sales_export.csv")
blue_chip_index <- top25000 %>%
select(graded_title,date,price,vcp_card_grade_id) %>%
filter(date >= as.Date("2018-07-08")) %>%
group_by(graded_title) %>%
summarise(Market_Cap = sum(price),
Average_Price = mean(price),
VCP_Card_Grade_ID = mean(vcp_card_grade_id),
count=n())%>%
filter(Market_Cap >= 10000 & count >= 25) %>%
rename(Total_Number_of_Sales = count)
top25000 %>%
filter(graded_title %in% blue_chip_index$graded_title)

Related

How do I filter based on a count within summarise in order to use as part of other summarise functions?

I am looking to figure out how to filter after grouping my data within summarise. I have 2 created columns below. I'd ideally like to filter the seasonTotal column within summarise to a value of greater than 3, and then calculate the homeRunsPerSeason based on that filtered count.
Reprex below:
library(Lahman)
library(tidyverse)
data <- Lahman::Batting
data <- data %>%
filter(yearID > 2015)
grouped_data <- data %>%
group_by(playerID) %>%
summarise(seasonTotal = n(),
homeRunsPerSeason = sum(HR / seasonTotal)
)
Separate each of the steps you want to accomplish. Calculate the season total, filter, then summarize.
grouped_data <- data %>%
group_by(playerID) %>%
mutate(seasonTotal = n()) %>%
filter(seasonTotal > 3) %>%
summarise(homeRunsPerSeason = sum(HR / seasonTotal))

R: Home sales in the last year before each sale

As a follow-up question to a previous one in the same project:
I found that real estate is often measured in inventory time, which is defined as (number of active listings) / (number of homes sale per month, as average over the last 12 months). The best way I could find to count the number of homes sold in the last 12 months before each home sale is through a for-loop.
homesales$yearlysales = 0
for (i in 1:nrow(homesales))
{
sdt = as.Date(homesales$saledate[i])
x <- homesales %>% filter( sdt - saledate >= 0 & sdt - saledate < 365) %>% summarise(count=n())
homesales$yearlysales[i] =x$count[1]
}
homesales$inventorytime = homesales$inventory / homesales$yearlysales * 12
homesales$inventorytime[is.na(homesales$saledate)] = NA
homesales$inventorytime[homesales$yearlysales==0] = NA
Obviously (?), the R language has some prejudice against using a for-loop for doing this type of selections. Is there a better way?
Appendix 1. data table structure
address, listingdate, saledate
101 Street, 2017/01/01, 2017/06/06
106 Street, 2017/03/01, 2017/08/11
102 Street, 2017/05/04, 2017/06/13
109 Street, 2017/07/04, 2017/11/24
...
Appendix 2. The output I'm looking for is something like this.
The following gives you the number of active listings on any given day:
library(tidyverse)
library(lubridate)
tmp <- tempfile()
download.file("https://raw.githubusercontent.com/robhanssen/glenlake-homesales/master/homesalesdata-source.csv", tmp)
data <- read_csv(tmp) %>%
select(ends_with("date")) %>%
mutate(across(everything(), mdy)) %>%
pivot_longer(cols = everything(), names_to = "activity", values_to ="date", names_pattern = "(.*)date")
active <- data %>%
mutate(active = if_else(activity == "listing", 1, -1)) %>%
arrange(date) %>%
mutate(active = cumsum(active)) %>%
group_by(date) %>%
filter(row_number() == n()) %>%
select(-activity)
tibble(date = seq(min(data$date, na.rm = TRUE), max(data$date, na.rm = TRUE), by = "days")) %>%
left_join(active) %>%
fill(active)
Basically, we pivot longer and split each row of data into two rows indicating distinct activities: adding a listing or removing a listing. Then the cumulative sum of this gives you the number of active listings.
Note, this assumes that you are not missing any data. Depending on the specification from which the csv was made, you could be missing activity at the start or end. But this is a warning about the csv itself.
Active listings is a fact about an instant in time. Sales is a fact about a time period. You probably want to aggregate sales by month, and then use the number of active listings from the last day of the month, or perhaps the average number of listings over that month.

How to build recommendation model for calling prospects

My goal is to better target prospects at a higher call success rate, based on time of day and prior history.
I have created a "Prodprobability" column showing the probability of a PropertyID answering the phone at that hour for the history of calls. Instead of merely omitting Property ID 233303.13 from any calls, I want to retarget them into hour 13 or hour 16 (the sample data doesn't show but the probability of pickup at those hours are 100% and 25% respectively).
So, moving forward, based on hour of day, and history of that prospect picking up the phone or not during that hour, I'd like to re-target every prospect during the hours they're most likely to pick up.
sample data
EDIT: I guess I need a formula to do this: If "S425=0", I want to search for where "A425" has the highest probability in the S column, and return the hour and probability for that "PropertyID". Hopefully that makes sense.
EDIT: :sample date returns this
The question here would be are you dead set on creating a 'model' or an automation works for you?
I would suggest ordering the dataframe by probability of picking the call every hour (so you can give the more probable leads first) and then further sorting them by number of calls on that day.
Something along the lines of:
require(dplyr)
todaysCall = df %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(noOfCalls = n())
hourlyCalls = df %>%
dplyr::filter(hour == format(Sys.time(),"%H")) %>%
dplyr::left_join(todaysCall) %>%
dplyr::arrange(desc(Prodprobability),noOfCalls)
Essentially, getting the probability of pickups are what models are all about and you already seem to have that information.
Alternate solution
Get top 5 calling times for each propertyID
top5Times = df %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:5L) %>%
dplyr::ungroup()
Get alternate calling time for cases with zero Prodprobability:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
alternateTimes = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::arrange(propertyID,desc(Prodprobability))
Best calling hour for cases with zero probability at given time:
#Identifies the zero prob cases; can be by hour or at a particular instant
zeroProb = df %>%
dplyr::filter(Prodprobability == 0)
#Gets the highest calling probability and corresponding closest hour if probability is same for more than one timeslot
bestTimeForZero = df %>%
dplyr::filter(propertyID %in% zeroProb$propertyID) %>%
dplyr::filter(Prodprobability != 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::arrange(desc(Prodprobability),hour) %>%
dplyr::slice(1L) %>%
dplyr::ungroup()
Returning number of records as per original df:
zeroProb = df %>%
dplyr::filter(Prodprobability == 0) %>%
dplyr::group_by(propertyID) %>%
dplyr::summarise(total = n())
bestTimesList = lapply(1:nrow(zeroProb),function(i){
limit = zeroProb$total[i]
bestTime = df %>%
dplyr::filter(propertyID == zeroProb$propertyID[i]) %>%
dplyr::arrange(desc(Prodprobability)) %>%
dplyr::slice(1:limit)
return(bestTime)
})
bestTimeDf = bind_rows(bestTimesList)
Note: You can combine the filter statements; I have written them separate to highlight what each step does.

Generating additional rows based on a condition within the same data frame

I have a data frame like DF below which will be imported directly from the database (as tibble).
library(tidyverse)
library(lubridate)
date_until <- dmy("31.05.2019")
date_val <- dmy("30.06.2018")
DF <- data.frame( date_bal = as.Date(c("2018-04-30", "2018-05-31", "2018-06-30", "2018-05-31", "2018-06-30")),
department = c("A","A","A","B","B"),
amount = c(10,20,30,40,50)
)
DF <- DF %>%
as_tibble()
DF
It represents the amount of money spent by each department in a specific month. My task is to project how much money will be spent by each department in the following months until a specified date in the future (in this case date_until=31.05.2019)
I would like to use tidyverse in order to generate additional rows for each department where the first column date_bal would be a sequence of dates from the last one from "original" DF up until date_until which is predefined. Then I would like to add additional column called "DIFF" which would represent the difference between DATE_BAL and DATE_VAL, where DATE_VAL is also predefined. My final result would look like this:
Final result
I have managed to do this in the following way:
first filter data from DF for department A
Create another DF2 by populating it with date sequence from min(dat_bal) to date_until from 1.
Merge data frames from 1. and 2. and then add calculated columns using mutate
Since I will have to repeat this procedure for many departments I wonder if it's possible to add rows (create date sequence) in existing DF (without creating a second DF and then merging).
Thanks in advance for your help and time.
I add one day to the dates, create a sequence and then rollback to the last day of the previous month.
seq(min(date_val + days(1)), date_until + days(1), by = 'months')[-1] %>%
rollback() %>%
tibble(date_bal = .) %>%
crossing(DF %>% distinct(department)) %>%
bind_rows(DF %>% select(date_bal, department)) %>%
left_join(DF) %>%
arrange(department, date_bal) %>%
mutate(
amount = if_else(is.na(amount), 0, amount),
DIFF = interval(
rollback(date_val, roll_to_first = TRUE),
rollback(date_bal, roll_to_first = TRUE)) %/% months(1)
)

Same difference in alternate rows per sub group

I have data frame where i need to find difference but for every alternate row the difference should stay same as the things to do are same like this:
but I have used this:
things <- data.frame( category = c("A","B","A","B","A","B","A","B","A","B"),
things2do = c("ball","ball","bat","bat","hockey","hockey","volley ball","volley ball","foos ball","foos ball"),
number = c(12,5,4,1,0,2,2,0,0,2))
things %>%
mutate(diff = number - lead(number,order_by=things2do))
but it is not helpful,as I am getting this:
Can i get some help here?
library(tidyverse)
things2 <- things %>%
spread(category, number) %>%
mutate(diff = B - A) %>%
gather(category, number, A:B) %>%
select(category, things2do, number, diff) %>%
arrange(things2do)
One way is to group the data by things2do and subsequently take an iterated difference.
library(dplyr)
things %>%
group_by(things2do) %>%
mutate(diff = diff(number))

Resources