I have not been able to find a similar question across SO. I am quite new to dplyr in general.
Consider a toy data frame of some sales statistics
df <- data.frame(
added = c("2020-10-05", "2020-10-30", "2020-11-04", "2020-12-10", "2020-12-14"),
closed = c("", "2020-11-05", "2020-12-10", "", ""),
value = c(100, 200, 300, 400, 500),
stage = c("Quote", "Won", "Lost", "Quote", "Quote")
)
It contains the date a deal was added, when it was closed, the deal value and the current stage of the deal. If a deal is not won or lost, it does not have a closed date. I want to calculate the pipeline value, i.e. the value of deals which are not yet won or lost, for each unique entry in the added and closed column (since these are the only times where the pipeline value can change).
I have gotten it to work through a mix of base R and dplyr. First I create a data frame of dates:
date_df <- as.data.frame(seq(as.Date("2020-10-01"),as.Date(Sys.Date()),1))
colnames(date_df) <- c("date")
My choice of starting date is arbitrary as long as it is before the first date in the added column. Then I evaluate this expression for each entry in date_df through a loop:
library(tidyverse)
pipeline <- c()
for (i in 1:nrow(date_df)) {
pipeline <-
df %>%
filter(
(added <= date_df$date[i] & closed > date_df$date[i] & closed < Sys.Date()) |
(added <= date_df$date[i] & stage != "Won" & stage != "Lost")
) %>%
summarise(pipeline = sum(value))
date_df$pipeline[i] <- pipeline
}
Which correctly gives me:
> date_df
date pipeline
1 2020-10-01 0
2 2020-10-02 0
3 2020-10-03 0
4 2020-10-04 0
5 2020-10-05 100
6 2020-10-06 100
7 2020-10-07 100
8 2020-10-08 100
9 2020-10-09 100
10 2020-10-10 100
11 2020-10-11 100
12 2020-10-12 100
13 2020-10-13 100
14 2020-10-14 100
15 2020-10-15 100
16 2020-10-16 100
17 2020-10-17 100
18 2020-10-18 100
19 2020-10-19 100
20 2020-10-20 100
21 2020-10-21 100
22 2020-10-22 100
23 2020-10-23 100
24 2020-10-24 100
25 2020-10-25 100
26 2020-10-26 100
27 2020-10-27 100
28 2020-10-28 100
29 2020-10-29 100
30 2020-10-30 300
31 2020-10-31 300
32 2020-11-01 300
33 2020-11-02 300
34 2020-11-03 300
35 2020-11-04 600
36 2020-11-05 400
37 2020-11-06 400
38 2020-11-07 400
39 2020-11-08 400
40 2020-11-09 400
41 2020-11-10 400
42 2020-11-11 400
43 2020-11-12 400
44 2020-11-13 400
45 2020-11-14 400
46 2020-11-15 400
47 2020-11-16 400
48 2020-11-17 400
49 2020-11-18 400
50 2020-11-19 400
51 2020-11-20 400
52 2020-11-21 400
53 2020-11-22 400
54 2020-11-23 400
55 2020-11-24 400
56 2020-11-25 400
57 2020-11-26 400
58 2020-11-27 400
59 2020-11-28 400
60 2020-11-29 400
61 2020-11-30 400
62 2020-12-01 400
63 2020-12-02 400
64 2020-12-03 400
65 2020-12-04 400
66 2020-12-05 400
67 2020-12-06 400
68 2020-12-07 400
69 2020-12-08 400
70 2020-12-09 400
71 2020-12-10 500
72 2020-12-11 500
73 2020-12-12 500
74 2020-12-13 500
75 2020-12-14 1000
76 2020-12-15 1000
77 2020-12-16 1000
78 2020-12-17 1000
79 2020-12-18 1000
80 2020-12-19 1000
81 2020-12-20 1000
82 2020-12-21 1000
Basically, I sum the value of open deals for each date. However, this appears to me as a very inefficient approach. The real data set is quite extensive, and to evaluate at each date takes quite some time.
I am only really interested in the expression to be evaluated for each unique entry in added and closed as described above. Are there any elegant solutions with the use of dplyr which accomplishes this, (preferably) without creating a new object? The idea is to later plot it and send it to a shiny app, so a reduction in computing time could be crucial.
Thanks in advance!
EDIT
To show an example for two dates 2020-11-04 and 2020-11-05:
On 2020-11-04, three deals (entry 1-3 in the data) are not either Lost or Won, hence the pipeline value for that date is 600.
On 2020-11-05, one deal is now Lost (second entry in the data), and the pipeline value drops to 400.
So, when a deal is closed, the pipeline value decreases, and when a deal is added, the pipeline value increases.
Values for all dates are added to the original question.
This approach will also be useful
result <- dff %>% mutate(id = row_number(),
added = as.Date(added),
closed = as.Date(closed)) %>%
pivot_longer(cols = c("added", "closed"), names_to = "activity", values_to = "dates") %>%
mutate(activity = factor(activity, levels = c("added", "closed"), ordered = T)) %>%
arrange(dates, activity) %>%
mutate(val = cumsum(value*case_when(activity == "added" ~ 1,
activity == "closed" ~ -1,
TRUE ~ 0))) %>%
group_by(dates) %>% summarise(val = min(val))
# A tibble: 7 x 2
dates val
<date> <dbl>
1 2020-10-05 100
2 2020-10-30 300
3 2020-11-04 600
4 2020-11-05 400
5 2020-12-10 500
6 2020-12-14 1000
7 NA 0
dput(dff) used
> dput(dff)
structure(list(added = structure(c(18540, 18565, 18570, 18606,
18610), class = "Date"), closed = structure(c(NA, 18571, 18606,
NA, NA), class = "Date"), value = c(100, 200, 300, 400, 500),
stage = c("Quote", "Won", "Lost", "Quote", "Quote")), row.names = c(NA,
-5L), class = "data.frame")
If you want to plot the results use complete and fill as
plot <- result %>% filter(!is.na(dates)) %>%
complete(dates = seq.Date(min(dates), max(dates), by = "day")) %>%
fill(val) %>%
ggplot() +
geom_line(aes(x = dates, y = val))
plot
I don't get why u're using the closed < Sys.Date() do you have future closed deals, if so this code will account for it:
df %>%
mutate(across(closed:added, lubridate::ymd), cl = closed < Sys.Date()) %>%
pivot_longer(c(closed, added), values_to="date") %>%
filter(!is.na(date)) %>%
group_by(date) %>%
summarise(pipeline=sum((stage=="Quote" | cl) *value)) %>%
mutate(pipeline=cumsum(pipeline))
# A tibble: 7 x 2
date summed
<date> <dbl>
1 2020-10-05 100
2 2020-11-05 300
3 2020-10-30 500
4 2020-12-10 1200
5 2020-11-04 1500
6 2020-12-10 2200
7 2020-12-14 2700
Related
I am working with some monthly data and I would like to convert it to daily data by creating and populating some dummy rows, as the question suggests.
For example, say I have the following data:
date index
2013-04-30 232
2013-05-31 232
2013-06-30 233
Is there an "easy" way, preferably through tidyverse, that I could convert the above data into daily data, assuming I keep the index constant throughout the month? For example, I would like to create another 29 rows for April, ranging from 2013-04-01 to 2013-04-29 with the index of the last day of the month which would be 232 for April. The same should be applied to the rest of months (I have more data than just those three months).
Any intuitive suggestions will be greatly appreciated :)
Using complete and fill from tidyr you could do:
dat <- structure(list(
date = structure(c(15825, 15856, 15886), class = "Date"),
index = c(232L, 232L, 233L)
), class = "data.frame", row.names = c(
NA,
-3L
))
library(tidyr)
dat |>
complete(date = seq(as.Date("2013-04-01"), as.Date("2013-06-30"), "day")) |>
fill(index, .direction = "up")
#> # A tibble: 91 × 2
#> date index
#> <date> <int>
#> 1 2013-04-01 232
#> 2 2013-04-02 232
#> 3 2013-04-03 232
#> 4 2013-04-04 232
#> 5 2013-04-05 232
#> 6 2013-04-06 232
#> 7 2013-04-07 232
#> 8 2013-04-08 232
#> 9 2013-04-09 232
#> 10 2013-04-10 232
#> # … with 81 more rows
Hi I am trying to find the YTD change. YTD formula is (current month value/last month of previous year)-1. The result I would like to get is in column y.
For example, for Jan-20 is (20/100)-1 ; Feb-20 is (120/100)-1. Basically all values divide by Dec-19 which is the last month of year 2019.
And for Jan-21, it should be divided by Dec-20 value so its (100/210)-1.
structure(list(date = structure(c(1575158400, 1577836800, 1580515200,
1583020800, 1585699200, 1588291200, 1590969600, 1593561600, 1596240000,
1598918400, 1601510400, 1604188800, 1606780800, 1609459200, 1612137600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), x = c(100,
20, 120, 90, 100, 40, 55, 70, 90, 120, 290, 100, 210, 100, 130
), y = c(NA, -0.8, 0.2, -0.1, 0, -0.6, -0.45, -0.3, -0.1, 0.2,
1.9, 0, 1.1, -0.523809523809524, -0.380952380952381)), class = "data.frame", row.names =
c(NA, -15L))
date x y
2019-12-01 100 NA
2020-01-01 20 -0.8000000
2020-02-01 120 0.2000000
2020-03-01 90 -0.1000000
2020-04-01 100 0.0000000
2020-05-01 40 -0.6000000
2020-06-01 55 -0.4500000
2020-07-01 70 -0.3000000
2020-08-01 90 -0.1000000
2020-09-01 120 0.2000000
2020-10-01 290 1.9000000
2020-11-01 100 0.0000000
2020-12-01 210 1.1000000
2021-01-01 100 -0.5238095
2021-02-01 130 -0.3809524
Here's a solution using the tidyverse and lubridate packages. First we create a data frame called last_per_year that stores the last value for each year. Then in the main data frame, we calculate each date's "previous year", and use this to join with last_per_year. With that done, it's simple to perform the YTD calculation.
This technique would make it easy to select multiple columns in last_per_year, join those into the main data set, and compute whatever calculations are needed.
library(tidyverse)
library(lubridate)
last_per_year <- df %>% # YOUR DATA GOES HERE
group_by(year = year(date)) %>% # for each year...
slice_max(order_by = date) %>% # get the last date in each year
select(year, last_value = x) # output columns are "year" and "last_value" (renamed from "x")
year last_value
<dbl> <dbl>
1 2019 100
2 2020 210
3 2021 130
df.new <- df %>%
select(-y) %>% # removing your example output
mutate(
year = year(date),
prev_year = year - 1
) %>%
inner_join(last_per_year, by = c(prev_year = 'year')) %>% # joining with "last_per_year"
mutate(
ytd = x / last_value - 1
)
df.new
date x year prev_year last_value ytd
1 2020-01-01 20 2020 2019 100 -0.8000000
2 2020-02-01 120 2020 2019 100 0.2000000
3 2020-03-01 90 2020 2019 100 -0.1000000
4 2020-04-01 100 2020 2019 100 0.0000000
5 2020-05-01 40 2020 2019 100 -0.6000000
6 2020-06-01 55 2020 2019 100 -0.4500000
7 2020-07-01 70 2020 2019 100 -0.3000000
8 2020-08-01 90 2020 2019 100 -0.1000000
9 2020-09-01 120 2020 2019 100 0.2000000
10 2020-10-01 290 2020 2019 100 1.9000000
11 2020-11-01 100 2020 2019 100 0.0000000
12 2020-12-01 210 2020 2019 100 1.1000000
13 2021-01-01 100 2021 2020 210 -0.5238095
14 2021-02-01 130 2021 2020 210 -0.3809524
I have a dataset that looks similar to:
Date
Total_Nonfarm
Leisure_and_Hospitality
2020-01-01
300
50
2020-02-01
200
40
2020-03-01
100
15
2020-04-01
75
15
2020-05-01
150
10
I need to calculate the monthly losses of both Nonfarm and Leisure and Hospitality jobs (two new columns) based on the total amount lost since 2020-02-01. So this would not affect rows before 2020-02-01, only after.
I've tried finding a way to print the value for the jobs dependent on date being equal to 2020-02-01, but could not successfully figure it out. I thought perhaps if I could print this value (say Feb2020 <- *the correct function to find that value*, then I could use dplyr to mutate a new column and execute mutate(Total_jobs_lost = Feb2020 - Total_Nonfarm. But I think this would then affect rows prior to 2020-02-01.
I would like the output to look as such:
Date
Total_Nonfarm
Leisure_and_Hospitality
Total_Nonfarm_Losses
LH_Losses
2020-01-01
300
50
NA
NA
2020-02-01
200
40
0
0
2020-03-01
100
15
-100
-25
2020-04-01
75
15
-125
-25
2020-05-01
150
10
-50
-30
Any help would be appreciated. Thanks.
require(tidyverse)
require(lubridate)
(df <- read_csv2(file = "data.csv"))
df$Date <- dmy(df$Date)
df$Total_Nonfarm_Losses <- NA
df$Leisure_and_Hospitality_Losses <- NA
timepoint <- ymd("2020/02/01")
(before <- df %>% filter(Date < timepoint))
(after <- df %>% filter(Date >= timepoint))
(
after
%>% mutate(Feb_Nonfarm = df[df$Date == timepoint, "Total_Nonfarm"] ,
Feb_Leisure = df[df$Date == timepoint, "Leisure_and_Hospitality"],
Total_Nonfarm_Losses = Feb_Nonfarm - Total_Nonfarm,
Leisure_and_Hospitality_Losses = Feb_Leisure - Leisure_and_Hospitality
)
%>% select(1:5)
%>% bind_rows(before, .)
)
And the output:
# A tibble: 5 x 5
Date Total_Nonfarm Leisure_and_Hospi~ Total_Nonfarm_Losses~ Leisure_and_Hospitality_L~
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-01-01 300 50 NA NA
2 2020-02-01 200 40 0 0
3 2020-03-01 100 15 100 25
4 2020-04-01 75 15 125 25
5 2020-05-01 150 10 50 30
library(tidyverse)
data <- read.delim("clipboard")
data$Total_Nonfarm_Losses <- rep("NA", nrow(data))
data$LH_Losses <- rep("NA", nrow(data))
Feb2020 <- filter(data,Date == "2020-02-01")$Total_Nonfarm
Feb2020b <- filter(data,Date == "2020-02-01")$Leisure_and_Hospitality
data[2:5,] <- data[2:5,] %>%
mutate(Total_Nonfarm_Losses = Total_Nonfarm- Feb2020,
LH_Losses = Leisure_and_Hospitality - Feb2020b)
`
Date Total_Nonfarm Leisure_and_Hospitality Total_Nonfarm_Losses LH_Losses
1 2020-01-01 300 50 NA NA
2 2020-02-01 200 40 0 0
3 2020-03-01 100 15 -100 -25
4 2020-04-01 75 15 -125 -25
5 2020-05-01 150 10 -50 -30
A data frame and some variables:
library(tidyverse)
library(lubridate)
budget_2020_q4 <- 1000000
budget_2021_q1 <- 2000000
budget_2021_q2 <- 3000000
budget_2021_q3 <- 3000000
budget_2021_q4 <- 2000000
calendar <- data.frame(
cohort = seq('2020-10-01' %>% ymd, '2021-12-31' %>% ymd, by = '1 days')) %>%
mutate(Quarter = quarter(cohort, with_year = T))
I now have a data frame showing dates and the quarter that those dates are in:
calendar %>% head
cohort Quarter
1 2020-10-01 2020.4
2 2020-10-02 2020.4
3 2020-10-03 2020.4
4 2020-10-04 2020.4
5 2020-10-05 2020.4
6 2020-10-06 2020.4
I also know the frequency of each quarter:
calendar$Quarter %>% table
.
2020.4 2021.1 2021.2 2021.3 2021.4
92 90 91 92 92
I would like to mutate a new column 'daily_budget' that takes the budget for the quarter and divides it over the frequency of dates in that quarter.
Example, the budget for q4 2020 is 1000000 and there are 92 days in Q4 so the daily budget is 1000000/92 = 10869.57
Can I somehow integrate this calculation into my dplyr pipeline of operations after mutate(Quarter = quarter(cohort, with_year = T))?
First, let's put the budgets in a data frame:
budgets <- c(budget_2020_q4 = 1000000,
budget_2021_q1 = 2000000,
budget_2021_q2 = 3000000,
budget_2021_q3 = 3000000,
budget_2021_q4 = 2000000) %>%
enframe(name = "Quarter", value = "budget") %>%
mutate(Quarter = as.numeric(str_replace(str_remove(Quarter, "budget_"), "_q", ".")))
Then, it is a matter of counting (tidyverse's alternative to table) the number of rows per Quarter, joining the budget and dividing the two:
calendar %>%
add_count(Quarter) %>%
left_join(budgets, by = "Quarter") %>%
mutate(budget_by_day = budget / n)
Which gives
cohort Quarter n budget budget_by_day
1 2020-10-01 2020.4 92 1e+06 10869.57
2 2020-10-02 2020.4 92 1e+06 10869.57
3 2020-10-03 2020.4 92 1e+06 10869.57
4 2020-10-04 2020.4 92 1e+06 10869.57
5 2020-10-05 2020.4 92 1e+06 10869.57
6 2020-10-06 2020.4 92 1e+06 10869.57
7 2020-10-07 2020.4 92 1e+06 10869.57
8 2020-10-08 2020.4 92 1e+06 10869.57
9 2020-10-09 2020.4 92 1e+06 10869.57
10 2020-10-10 2020.4 92 1e+06 10869.57
...
I'm trying to populate "FinalDate" based on "ExpectedDate" and "ObservedDate".
The rules are: for each group, if observed date is greater than previous expected date and less than the next expected date then final date is equal to observed date, otherwise final date is equal to expected date.
How can I modify the code below to make sure that:
FinalDate is filled in by Group
Iteration numbers don't skip any rows
set.seed(2)
dat<-data.frame(Group=sample(LETTERS[1:10], 100, replace=TRUE),
Date=sample(seq(as.Date('2013/01/01'), as.Date('2020/01/01'), by="day"), 100))%>%
mutate(ExpectedDate=Date+sample(10:200, 100, replace=TRUE),
ObservedDate=Date+sample(10:200, 100, replace=TRUE))%>%
group_by(Group)%>%
arrange(Date)%>%
mutate(n=row_number())%>%arrange(Group)%>%ungroup()%>%
as.data.frame()
#generate some missing values in "ObservedDate"
dat[sample(nrow(dat),20), "ObservedDate"]<-NA
dat$FinalDate<-NA
for (i in 1:nrow(dat)){
dat[i, "FinalDate"]<-if_else(!is.na(dat$"ObservedDate")[i] &&
dat[i, "ObservedDate"] > dat[i-1, "ExpectedDate"] &&
dat[i, "ObservedDate"] < dat[i+1, "ExpectedDate"],
dat[i, "ObservedDate"],
dat[i,"ExpectedDate"])
}
dat$FinalDate<-as.Date(dat$FinalDate) # convert numeric to Date format
e.g. in output below:
at i=90, the code looks for previous ExpectedDate within letter I
we want it to look for ExpectedDate only within letter J. If there is no previous expected date for a group and ObservedDate is greater than ExpectedDate but less than the next ExpectedDate then FinalDate should be filled with ExpectedDate.
at i=100, the code generates NA because there is no next observation available
we want this value to be filled in such that for last observation in each group, FinalDate=ObservedDate if ObservedDate is greater than this last ExpectedDate within group, else ExpectedDate.
Group Date ExpectedDate ObservedDate n FinalDate
88 I 2015-09-07 2015-12-05 <NA> 7 2015-12-05
89 I 2018-08-02 2018-11-01 2018-08-13 8 2018-11-01
90 J 2013-07-24 2013-08-30 2013-08-12 1 2013-08-30
91 J 2013-11-22 2014-01-02 2014-04-05 2 2014-04-05
92 J 2014-11-03 2015-03-23 2015-05-10 3 2015-05-10
93 J 2015-08-30 2015-12-09 2016-02-04 4 2016-02-04
94 J 2016-04-18 2016-09-03 <NA> 5 2016-09-03
95 J 2016-10-10 2017-01-29 2017-04-14 6 2017-04-14
96 J 2017-02-14 2017-07-05 <NA> 7 2017-07-05
97 J 2017-04-21 2017-10-01 2017-08-26 8 2017-08-26
98 J 2017-10-01 2018-01-27 2018-02-28 9 2018-02-28
99 J 2018-08-03 2019-01-31 2018-10-20 10 2018-10-20
100 J 2019-04-25 2019-06-23 2019-08-16 11 <NA>
We can let go off for loop and use group_by, lag and lead here from dplyr :
library(dplyr)
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(ObservedDate > lag(ExpectedDate) &
ObservedDate < lead(ExpectedDate), ObservedDate, ExpectedDate))
We can also do this data.table::between
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(data.table::between(ObservedDate,
lag(ExpectedDate), lead(ExpectedDate)), ObservedDate, ExpectedDate))