Hi I am trying to find the YTD change. YTD formula is (current month value/last month of previous year)-1. The result I would like to get is in column y.
For example, for Jan-20 is (20/100)-1 ; Feb-20 is (120/100)-1. Basically all values divide by Dec-19 which is the last month of year 2019.
And for Jan-21, it should be divided by Dec-20 value so its (100/210)-1.
structure(list(date = structure(c(1575158400, 1577836800, 1580515200,
1583020800, 1585699200, 1588291200, 1590969600, 1593561600, 1596240000,
1598918400, 1601510400, 1604188800, 1606780800, 1609459200, 1612137600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), x = c(100,
20, 120, 90, 100, 40, 55, 70, 90, 120, 290, 100, 210, 100, 130
), y = c(NA, -0.8, 0.2, -0.1, 0, -0.6, -0.45, -0.3, -0.1, 0.2,
1.9, 0, 1.1, -0.523809523809524, -0.380952380952381)), class = "data.frame", row.names =
c(NA, -15L))
date x y
2019-12-01 100 NA
2020-01-01 20 -0.8000000
2020-02-01 120 0.2000000
2020-03-01 90 -0.1000000
2020-04-01 100 0.0000000
2020-05-01 40 -0.6000000
2020-06-01 55 -0.4500000
2020-07-01 70 -0.3000000
2020-08-01 90 -0.1000000
2020-09-01 120 0.2000000
2020-10-01 290 1.9000000
2020-11-01 100 0.0000000
2020-12-01 210 1.1000000
2021-01-01 100 -0.5238095
2021-02-01 130 -0.3809524
Here's a solution using the tidyverse and lubridate packages. First we create a data frame called last_per_year that stores the last value for each year. Then in the main data frame, we calculate each date's "previous year", and use this to join with last_per_year. With that done, it's simple to perform the YTD calculation.
This technique would make it easy to select multiple columns in last_per_year, join those into the main data set, and compute whatever calculations are needed.
library(tidyverse)
library(lubridate)
last_per_year <- df %>% # YOUR DATA GOES HERE
group_by(year = year(date)) %>% # for each year...
slice_max(order_by = date) %>% # get the last date in each year
select(year, last_value = x) # output columns are "year" and "last_value" (renamed from "x")
year last_value
<dbl> <dbl>
1 2019 100
2 2020 210
3 2021 130
df.new <- df %>%
select(-y) %>% # removing your example output
mutate(
year = year(date),
prev_year = year - 1
) %>%
inner_join(last_per_year, by = c(prev_year = 'year')) %>% # joining with "last_per_year"
mutate(
ytd = x / last_value - 1
)
df.new
date x year prev_year last_value ytd
1 2020-01-01 20 2020 2019 100 -0.8000000
2 2020-02-01 120 2020 2019 100 0.2000000
3 2020-03-01 90 2020 2019 100 -0.1000000
4 2020-04-01 100 2020 2019 100 0.0000000
5 2020-05-01 40 2020 2019 100 -0.6000000
6 2020-06-01 55 2020 2019 100 -0.4500000
7 2020-07-01 70 2020 2019 100 -0.3000000
8 2020-08-01 90 2020 2019 100 -0.1000000
9 2020-09-01 120 2020 2019 100 0.2000000
10 2020-10-01 290 2020 2019 100 1.9000000
11 2020-11-01 100 2020 2019 100 0.0000000
12 2020-12-01 210 2020 2019 100 1.1000000
13 2021-01-01 100 2021 2020 210 -0.5238095
14 2021-02-01 130 2021 2020 210 -0.3809524
Related
patient_id dt_diag_init enroll
<int64> <date> <date>
1 10401 2018-01-04 2020-09-30
2 60701 2019-05-31 2019-09-30
3 343702 2018-12-05 2020-09-30
4 472202 2019-12-30 2020-09-30
5 489502 2019-09-17 2019-11-30
6 557401 2019-10-15 2020-09-30
7 857901 2018-01-02 2020-09-30
8 874201 2018-01-01 2020-09-30
9 1309102 2019-03-11 2020-09-30
10 1317601 2018-08-14 2020-09-30
I am trying to tally the number of patients per quarter based on the date range (dt_diag_init and enroll). The patient_id 10401 has starts from 2018-01-04 to 2020-09-30 so that patient would be counted in Q1'18, Q2'18....to Q3'20 since the date range overlap. I am trying to get an output table with just the Quarter and the count of patients per quarter.
Sample output table
Qtr year total
<chr> <dbl> <dbl>
1 Q1'18 2018 485
2 Q2'18 2018 516
3 Q3'18 2018 560
This is what I tried to create the table with quarter column before doing the tally but I am getting errors:
df_1 <- df %>%
mutate(quarter = map2(
as.numeric(dt_diag_init),
as.numeric(enroll),
~ format(seq(.x, .y, by="quarter"), "Q%q'%y")
))
Sample date:
df <- structure(list(patient_id = structure(c(5.13877678239481e-320,
2.99902787682095e-319, 1.69811350606928e-318, 2.33298786097528e-318,
2.41846121770582e-318, 2.75392685057557e-318, 4.23859411632851e-318,
4.31912681660064e-318, 6.46782325102068e-318, 6.50981389026072e-318
), class = "integer64"), dt_diag_init = structure(c(17535, 18047,
17870, 18260, 18156, 18184, 17533, 17532, 17966, 17757), class = "Date"),
enroll = structure(c(18535, 18169, 18535, 18535, 18230, 18535,
18535, 18535, 18535, 18535), class = "Date")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Perhaps this helps
library(dplyr)
library(zoo)
library(tidyr)
library(purrr)
library(lubridate)
df %>%
mutate(quarter = map2(dt_diag_init, enroll,
~ as.yearqtr(seq(.x, .y, by = 'quarter')))) %>%
unnest(quarter) %>%
mutate(year = year(quarter)) %>%
count(quarter, year, name = 'total')
-output
# A tibble: 11 × 3
quarter year total
<yearqtr> <dbl> <int>
1 2018 Q1 2018 3
2 2018 Q2 2018 3
3 2018 Q3 2018 4
4 2018 Q4 2018 5
5 2019 Q1 2019 6
6 2019 Q2 2019 7
7 2019 Q3 2019 8
8 2019 Q4 2019 8
9 2020 Q1 2020 8
10 2020 Q2 2020 8
11 2020 Q3 2020 8
I have a dataframe like so:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
2019-07-01 100
2019-08-01 120
Problem is that nothing happen in May and June, hence there is no data. Plotting this in barcharts results in some empty space on the x-axis.
Is there some way to "fill" the missing spot like so, using the last known value?:
Month CumulativeSum
2019-02-01 40
2019-03-01 70
2019-04-01 80
**2019-05-01 80** <--
**2019-06-01 80** <--
2019-07-01 100
2019-08-01 120
We can use complete
library(dplyr)
library(tidyr)
df1 %>%
complete(Month = seq(min(Month), max(Month), by = '1 month')) %>%
fill(CumulativeSum)
-output
# A tibble: 7 x 2
# Month CumulativeSum
# <date> <int>
#1 2019-02-01 40
#2 2019-03-01 70
#3 2019-04-01 80
#4 2019-05-01 80
#5 2019-06-01 80
#6 2019-07-01 100
#7 2019-08-01 120
data
df1 <- structure(list(Month = structure(c(17928, 17956, 17987, 18078,
18109), class = "Date"), CumulativeSum = c(40L, 70L, 80L, 100L,
120L)), row.names = c(NA, -5L), class = "data.frame")
Here is a base R option using cummax
transform(
data.frame(
Month = seq(min(df$Month), max(df$Month), by = "1 month"),
CumulativeSum = -Inf
),
CumulativeSum = cummax(replace(CumulativeSum, Month %in% df$Month, df$CumulativeSum))
)
which gives
Month CumulativeSum
1 2019-02-01 40
2 2019-03-01 70
3 2019-04-01 80
4 2019-05-01 80
5 2019-06-01 80
6 2019-07-01 100
7 2019-08-01 120
I have not been able to find a similar question across SO. I am quite new to dplyr in general.
Consider a toy data frame of some sales statistics
df <- data.frame(
added = c("2020-10-05", "2020-10-30", "2020-11-04", "2020-12-10", "2020-12-14"),
closed = c("", "2020-11-05", "2020-12-10", "", ""),
value = c(100, 200, 300, 400, 500),
stage = c("Quote", "Won", "Lost", "Quote", "Quote")
)
It contains the date a deal was added, when it was closed, the deal value and the current stage of the deal. If a deal is not won or lost, it does not have a closed date. I want to calculate the pipeline value, i.e. the value of deals which are not yet won or lost, for each unique entry in the added and closed column (since these are the only times where the pipeline value can change).
I have gotten it to work through a mix of base R and dplyr. First I create a data frame of dates:
date_df <- as.data.frame(seq(as.Date("2020-10-01"),as.Date(Sys.Date()),1))
colnames(date_df) <- c("date")
My choice of starting date is arbitrary as long as it is before the first date in the added column. Then I evaluate this expression for each entry in date_df through a loop:
library(tidyverse)
pipeline <- c()
for (i in 1:nrow(date_df)) {
pipeline <-
df %>%
filter(
(added <= date_df$date[i] & closed > date_df$date[i] & closed < Sys.Date()) |
(added <= date_df$date[i] & stage != "Won" & stage != "Lost")
) %>%
summarise(pipeline = sum(value))
date_df$pipeline[i] <- pipeline
}
Which correctly gives me:
> date_df
date pipeline
1 2020-10-01 0
2 2020-10-02 0
3 2020-10-03 0
4 2020-10-04 0
5 2020-10-05 100
6 2020-10-06 100
7 2020-10-07 100
8 2020-10-08 100
9 2020-10-09 100
10 2020-10-10 100
11 2020-10-11 100
12 2020-10-12 100
13 2020-10-13 100
14 2020-10-14 100
15 2020-10-15 100
16 2020-10-16 100
17 2020-10-17 100
18 2020-10-18 100
19 2020-10-19 100
20 2020-10-20 100
21 2020-10-21 100
22 2020-10-22 100
23 2020-10-23 100
24 2020-10-24 100
25 2020-10-25 100
26 2020-10-26 100
27 2020-10-27 100
28 2020-10-28 100
29 2020-10-29 100
30 2020-10-30 300
31 2020-10-31 300
32 2020-11-01 300
33 2020-11-02 300
34 2020-11-03 300
35 2020-11-04 600
36 2020-11-05 400
37 2020-11-06 400
38 2020-11-07 400
39 2020-11-08 400
40 2020-11-09 400
41 2020-11-10 400
42 2020-11-11 400
43 2020-11-12 400
44 2020-11-13 400
45 2020-11-14 400
46 2020-11-15 400
47 2020-11-16 400
48 2020-11-17 400
49 2020-11-18 400
50 2020-11-19 400
51 2020-11-20 400
52 2020-11-21 400
53 2020-11-22 400
54 2020-11-23 400
55 2020-11-24 400
56 2020-11-25 400
57 2020-11-26 400
58 2020-11-27 400
59 2020-11-28 400
60 2020-11-29 400
61 2020-11-30 400
62 2020-12-01 400
63 2020-12-02 400
64 2020-12-03 400
65 2020-12-04 400
66 2020-12-05 400
67 2020-12-06 400
68 2020-12-07 400
69 2020-12-08 400
70 2020-12-09 400
71 2020-12-10 500
72 2020-12-11 500
73 2020-12-12 500
74 2020-12-13 500
75 2020-12-14 1000
76 2020-12-15 1000
77 2020-12-16 1000
78 2020-12-17 1000
79 2020-12-18 1000
80 2020-12-19 1000
81 2020-12-20 1000
82 2020-12-21 1000
Basically, I sum the value of open deals for each date. However, this appears to me as a very inefficient approach. The real data set is quite extensive, and to evaluate at each date takes quite some time.
I am only really interested in the expression to be evaluated for each unique entry in added and closed as described above. Are there any elegant solutions with the use of dplyr which accomplishes this, (preferably) without creating a new object? The idea is to later plot it and send it to a shiny app, so a reduction in computing time could be crucial.
Thanks in advance!
EDIT
To show an example for two dates 2020-11-04 and 2020-11-05:
On 2020-11-04, three deals (entry 1-3 in the data) are not either Lost or Won, hence the pipeline value for that date is 600.
On 2020-11-05, one deal is now Lost (second entry in the data), and the pipeline value drops to 400.
So, when a deal is closed, the pipeline value decreases, and when a deal is added, the pipeline value increases.
Values for all dates are added to the original question.
This approach will also be useful
result <- dff %>% mutate(id = row_number(),
added = as.Date(added),
closed = as.Date(closed)) %>%
pivot_longer(cols = c("added", "closed"), names_to = "activity", values_to = "dates") %>%
mutate(activity = factor(activity, levels = c("added", "closed"), ordered = T)) %>%
arrange(dates, activity) %>%
mutate(val = cumsum(value*case_when(activity == "added" ~ 1,
activity == "closed" ~ -1,
TRUE ~ 0))) %>%
group_by(dates) %>% summarise(val = min(val))
# A tibble: 7 x 2
dates val
<date> <dbl>
1 2020-10-05 100
2 2020-10-30 300
3 2020-11-04 600
4 2020-11-05 400
5 2020-12-10 500
6 2020-12-14 1000
7 NA 0
dput(dff) used
> dput(dff)
structure(list(added = structure(c(18540, 18565, 18570, 18606,
18610), class = "Date"), closed = structure(c(NA, 18571, 18606,
NA, NA), class = "Date"), value = c(100, 200, 300, 400, 500),
stage = c("Quote", "Won", "Lost", "Quote", "Quote")), row.names = c(NA,
-5L), class = "data.frame")
If you want to plot the results use complete and fill as
plot <- result %>% filter(!is.na(dates)) %>%
complete(dates = seq.Date(min(dates), max(dates), by = "day")) %>%
fill(val) %>%
ggplot() +
geom_line(aes(x = dates, y = val))
plot
I don't get why u're using the closed < Sys.Date() do you have future closed deals, if so this code will account for it:
df %>%
mutate(across(closed:added, lubridate::ymd), cl = closed < Sys.Date()) %>%
pivot_longer(c(closed, added), values_to="date") %>%
filter(!is.na(date)) %>%
group_by(date) %>%
summarise(pipeline=sum((stage=="Quote" | cl) *value)) %>%
mutate(pipeline=cumsum(pipeline))
# A tibble: 7 x 2
date summed
<date> <dbl>
1 2020-10-05 100
2 2020-11-05 300
3 2020-10-30 500
4 2020-12-10 1200
5 2020-11-04 1500
6 2020-12-10 2200
7 2020-12-14 2700
I would appreciate your help in calculating the number of days sine the last purchase per user Id. I attached the dateset with the expected target.
Thank you,
We can group by 'USERID' and get the difftime of the current and past 'Datetime' converted 'date' column
library(lubridate)
library(dplyr)
df1 %>%
mutate(date = mdy_hm(date)) %>% # convert to Datetime class
group_by(USERID) %>% #group by USERID
mutate(numberofdays = as.integer(difftime(date, # take the difference
lag(date, default = first(date)), unit = 'day')))
# A tibble: 8 x 5
# Groups: USERID [3]
# ID date USERID SALES numberofdays
# <int> <dttm> <dbl> <dbl> <int>
#1 1 2018-11-19 10:36:00 500 1000 0
#2 2 2018-11-19 10:41:00 520 1450 0
#3 3 2018-11-23 10:59:00 500 1390 4
#4 4 2018-11-23 11:12:00 530 1778 0
#5 5 2018-11-29 11:52:00 530 1966 6
#6 6 2018-12-05 12:23:00 520 1100 16
#7 7 2018-12-19 12:24:00 520 700 14
#8 8 2018-12-25 21:24:00 520 900 6
data
df1 <- structure(list(ID = 1:8, date = c("11/19/2018 10:36", "11/19/2018 10:41",
"11/23/2018 10:59", "11/23/2018 11:12", "11/29/2018 11:52", "12/5/2018 12:23",
"12/19/2018 12:24", "12/25/2018 21:24"), USERID = c(500, 520,
500, 530, 530, 520, 520, 520), SALES = c(1000, 1450, 1390, 1778,
1966, 1100, 700, 900)), class = "data.frame", row.names = c(NA,
-8L))
I have below-mentioned dataframe in R.
DF
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
By utilizing the above-mentioned dataframe, I want to bifurcate the count basis month and time bucket considering the Datetime.
Required Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15
You can bin the hours of the day by using hour from the lubridate package and then cut from base R, before summarizing with dplyr.
Here, I am assuming that your Datetime column is actually in a date-time format and not just a character string or factor. If it is, ensure you have done DF$Datetime <- as.POSIXct(as.character(DF$Datetime)) first to convert it.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 5.99, 11.99, 14.99, 19.99, 24))
levels(DF$bins) <- c("00:00 to 05:59", "06:00 to 11:59", "12:00 to 14:59",
"15:00 to 19:59", "20:00 to 23:59")
newDF <- DF %>%
group_by(bins, .drop = FALSE) %>%
summarise(Count = length(Value), Total = sum(Value))
This gives the following result:
newDF
#> # A tibble: 5 x 3
#> bins Count Total
#> <fct> <int> <dbl>
#> 1 00:00 to 05:59 2 45
#> 2 06:00 to 11:59 0 0
#> 3 12:00 to 14:59 1 20
#> 4 15:00 to 19:59 3 35
#> 5 20:00 to 23:59 1 15
And if you want to add January as a first row (though I'm not sure how much sense this makes in this context) you could do:
newDF %>%
summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
#> # A tibble: 6 x 3
#> bins Count Total
#> <chr> <int> <dbl>
#> 1 January 7 115
#> 2 00:00 to 05:59 2 45
#> 3 06:00 to 11:59 0 0
#> 4 12:00 to 14:59 1 20
#> 5 15:00 to 19:59 3 35
#> 6 20:00 to 23:59 1 15
Incidentally, the reproducible version of the data I used for this was:
structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = ""), Value = c(10,
20, 25, 20, 10, 15, 15)), class = "data.frame", row.names = c(NA,
-7L))