Generating additional rows based on a condition within the same data frame - r

I have a data frame like DF below which will be imported directly from the database (as tibble).
library(tidyverse)
library(lubridate)
date_until <- dmy("31.05.2019")
date_val <- dmy("30.06.2018")
DF <- data.frame( date_bal = as.Date(c("2018-04-30", "2018-05-31", "2018-06-30", "2018-05-31", "2018-06-30")),
department = c("A","A","A","B","B"),
amount = c(10,20,30,40,50)
)
DF <- DF %>%
as_tibble()
DF
It represents the amount of money spent by each department in a specific month. My task is to project how much money will be spent by each department in the following months until a specified date in the future (in this case date_until=31.05.2019)
I would like to use tidyverse in order to generate additional rows for each department where the first column date_bal would be a sequence of dates from the last one from "original" DF up until date_until which is predefined. Then I would like to add additional column called "DIFF" which would represent the difference between DATE_BAL and DATE_VAL, where DATE_VAL is also predefined. My final result would look like this:
Final result
I have managed to do this in the following way:
first filter data from DF for department A
Create another DF2 by populating it with date sequence from min(dat_bal) to date_until from 1.
Merge data frames from 1. and 2. and then add calculated columns using mutate
Since I will have to repeat this procedure for many departments I wonder if it's possible to add rows (create date sequence) in existing DF (without creating a second DF and then merging).
Thanks in advance for your help and time.

I add one day to the dates, create a sequence and then rollback to the last day of the previous month.
seq(min(date_val + days(1)), date_until + days(1), by = 'months')[-1] %>%
rollback() %>%
tibble(date_bal = .) %>%
crossing(DF %>% distinct(department)) %>%
bind_rows(DF %>% select(date_bal, department)) %>%
left_join(DF) %>%
arrange(department, date_bal) %>%
mutate(
amount = if_else(is.na(amount), 0, amount),
DIFF = interval(
rollback(date_val, roll_to_first = TRUE),
rollback(date_bal, roll_to_first = TRUE)) %/% months(1)
)

Related

Subtracting dates in the same row by a factor. R

I have the following data frame:
DF<-data.frame(stringsAsFactors = TRUE,
Sample = c(rep("s1",4),rep("s2",4)),
date = c("21/07/2020","24/07/2020","25/07/2020","27/07/2020",
"03/08/2020","06/08/2020","09/08/2020","10/08/2020"))
First I want to obtain the number of days between consecutive dates by the factor "Sample". so the output would be like this:
DF_2<-data.frame(stringsAsFactors = TRUE,
Sample = c(rep("s1",4),rep("s2",4)),
date = c("21/07/2020","24/07/2020","25/07/2020","27/07/2020",
"03/08/2020","06/08/2020","09/08/2020","10/08/2020"),
days = c(NA,3,1,2,NA,3,3,1))
Where variable "days" is my outcome variable.
Afterwards I want to add all those "days" by factor. But that is easy, will do it like this:
df_3<-aggregate(days~Sample,DF_2,sum)
I would much appreciate it if someone helps me to get right first step, to get DF_2.
We can use diff to get the difference between Date class converted 'date' column
library(dplyr)
library(lubridate)
DF1 <- DF %>%
mutate(date = dmy(date)) %>%
group_by(Sample) %>%
mutate(days = c(NA, diff(date))) %>%
ungroup

Using runner package to summarise groups

I have a table of house prices and sale dates. I want to calculate the rolling median price over a time window of 365 days using the runner package. I only want one median price per date.
My problem is when I try the below code, I get more than one median price for a date if that date appears more than once. This isn't what I expected to occur. I thought there'd be one result for each day if I used group_by/summarise.
library(runner)
library(tidyverse)
library(lubridate)
startDate = as_date("2018-01-01")
endDate = as_date("2020-01-01")
# Create data
soldData <- tibble(
price = round(rnorm(100, mean=500000, sd=100000),-3),
date = sample(seq.Date(startDate,endDate,by="days"),100,replace=T))
# Fill in the missing dates between startDate and endDate
soldData <- bind_rows(soldData,anti_join(tibble(date=seq.Date(startDate,endDate,by="day")),soldData)) %>%
arrange(date)
# Find the duplicated dates
duplicatedDates <- soldData[duplicated(soldData$date),]$date
# I thought using group_by/summarise would return one medianPrice per date
results <- soldData %>%
group_by(date) %>%
summarise(medianPrice = runner(
price,
k = "365 days",
idx = date,
f = function(x) {median(x,na.rm=T)}))
# These are the problem rows.
duplicatedResults <- results %>%
filter(date %in% duplicatedDates)
Any idea where I'm going wrong?
From dplyr 1.0.0, you can have output that returns multiple rows from summarise.
First you need to deal with duplicate data which you already have in your data. What do you want to do of dates that have multiple occurrence? One way would be to take median/mean of them.
library(dplyr)
library(runner)
soldData %>%
group_by(date) %>%
summarise(price = median(price, na.rm = TRUE)) -> df
So now in df we only have one value for each date. You can now apply the runner function.
df %>%
mutate(medianPrice = runner(price,
k = "365 days",
idx = date,
f = function(x) {median(x,na.rm=T)}))
There is also zoo:rollmedianr which helps in calculating rolling median.

Need help ungrouping data frame

I have a dataset that I used group as to filter and now I am trying to expand back out the rows that I originally needed to consolidate in order to filter. The data was of a set of sales points by sales date. There is a row for each sales point but many different sales points for the same type of item. So I grouped by item in order to get important information like total sales and average sales price, etc. Now, after I have filtered and found the items with the correct sales frequency etc., I need to ungroup the data in order to see all of the sales points. I have tried using "ungroup()", but either it is not working or I am not doing it right.
top25000 <- read_csv("index_sales_export.csv")
blue_chip_index <- top25000 %>%
select(graded_title,date,price,vcp_card_grade_id) %>%
filter(date >= as.Date("2018-07-08")) %>%
group_by(graded_title) %>%
summarise(Market_Cap = sum(price),Average_Price = mean(price),VCP_Card_Grade_ID = mean(vcp_card_grade_id),count=n())%>%
filter(Market_Cap >= 10000 & count >= 25)%>%
rename(Total_Number_of_Sales = count)%>%
blue_chip_index
If I understand what you want, it sounds like you should just start with the original data frame again, then filter on the items that you want.
Ungrouping the new data frame won’t allow you to unsummarize it.
Does this achieve want you want:
library(tidyverse)
top25000 <- read_csv("index_sales_export.csv")
blue_chip_index <- top25000 %>%
select(graded_title,date,price,vcp_card_grade_id) %>%
filter(date >= as.Date("2018-07-08")) %>%
group_by(graded_title) %>%
summarise(Market_Cap = sum(price),
Average_Price = mean(price),
VCP_Card_Grade_ID = mean(vcp_card_grade_id),
count=n())%>%
filter(Market_Cap >= 10000 & count >= 25) %>%
rename(Total_Number_of_Sales = count)
top25000 %>%
filter(graded_title %in% blue_chip_index$graded_title)

How to check if an id comes into data on a particular date that it stays until an exit date

I have a data set that looks something like below. Basically, I am interested in checking if a particular id is present at the beginning of the year(in this case jan,1,2003) that it is present everyday until the end of the year( dec 31 2003) then starting the checking process over again with the start of next year as people might change from year to year but should not change within a year. If on certain day, an id is not present I would like to know which day and which id.
I first started with a for loop and checked every two days but this is super inefficient since my data set spans roughly 50 years and will grow later on with new data.
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
Edit:The above chunk has all the dates in it but if I delete for example id = 1 on the second day, the code should tell me it is missing so the count shouldn't be the same. I added the piece to delete the id = 1 on the second day below.
df <- df[-4,]
The code below will make the same data set but delete id = 1 for jan 2, 2003 and jan 3, 2003. I am trying to get something that returns the id that is missing and the date.
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
df <- df[-4,]
df <- df[-6,]
This code chunk will count number of times a person appears in each year. if the answer is 365 or 366 in leap years a person was there everyday of the year.
library(dplyr)
library(tidyr)
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
dfx <- df %>%
mutate(yrs = lubridate::year(dates)) %>%
group_by(id, dates) %>%
filter(row_number()==1) %>%
group_by(id, yrs) %>%
tally
#remove values
dfa <- df[c(-4,-6),]
The in oder to find the date of missing value add an indicator column to the data set. then fill in the missing dates by id. After this the val column will have missing values. Filter the data to get the dates where it went missing.
dfx <- dfa %>%
mutate(val = 1) %>%
complete(nesting(id),
dates = seq(min(dates),max(dates),by = "day")) %>%
filter(is.na(val))

dplyr mutate with within group select function

I am trying to use a select function within a mutate one in dplyr after grouping.
df <- data.frame(city = c(rep(1,4),rep(2,4)),
year = rep(1,8),
victory = rep(c(1,0,0,0),2),
affiliation = c("a","b","c","d","e","f","g","h"))
Non working Code:
data %>%
group_by(.dots=c("city","year")) %>%
mutate(group_affiliation = affiliation(victory==1))
Expectation:
group_affiliation = c(rep("a",4), rep("e",4)
For each city, year group I am trying to get the affiliation value for the entry defined by victory == 1 and then mutating to the whole group.
P.S. I would have done it in two parts, re-merging the groups but my computer is incapable of handling the vector size
library(dplyr)
df %>%
group_by(city,year) %>%
mutate(group_affiliation = affiliation[victory==1])

Resources