Below is the sample data. The goal is to first create a column that contains the total employment for that quarter. Second is to create a new column that shows the relative share for the area. Finally, the last item (and one which is vexing me) is to calculate whether the total with suppress = 0 represents over 50% of the total. I can do this in excel easily but trying to this in R and so have it be something that I can replicate year after year.
desired result is below
area <- c("001","005","007","009","011","013","015","017","019","021","023","027","033","001","005","007","009","011","013","015","017","019","021","023","027","033")
year <- c("2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021")
qtr <- c("01","01","01","01","01","01","01","01","01","01","01","01","01","02","02","02","02","02","02","02","02","02","02","02","02","02")
employment <- c(2,4,6,8,11,10,12,14,16,18,20,22,30,3,5,8,9,12,9,24,44,33,298,21,26,45)
suppress <- c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0)
testitem <- data.frame(year,qtr, area, employment, suppress)
For the first quarter of 2021, the total is 173. If you only take suppress = 1 into account, that is only 24 of 173 hence the TRUE in the 50 percent column. If these two values summed up to 173/2 or greater than you would have it say FALSE. For the second quarter, the suppress = 1 accounts for 310 of the total of 537 and so is over 50% of the total.
For the total column, I am showing the computation or ingredients. Ideally, it would show a value such as .0115 in place of 2/173.
year qtr area employment suppress total 50percent
2021 01 001 2 0 =2/173 TRUE
2021 01 005 4 0 =4/173 TRUE
.....
2021 02 001 3 0 =3/537 FALSE
2021 02 005 5 0 =5/537 FALSE
For example:
library(dplyr)
testitem %>%
group_by(year, qtr) %>%
mutate(
total = employment / sum(employment),
over_half = sum(employment[suppress == 0]) > (0.5 * sum(employment))
)
Gives:
# A tibble: 26 × 7
# Groups: year, qtr [2]
year qtr area employment suppress total over_half
<chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
1 2021 01 001 2 0 0.0116 TRUE
2 2021 01 005 4 0 0.0231 TRUE
3 2021 01 007 6 0 0.0347 TRUE
4 2021 01 009 8 1 0.0462 TRUE
5 2021 01 011 11 0 0.0636 TRUE
6 2021 01 013 10 0 0.0578 TRUE
7 2021 01 015 12 0 0.0694 TRUE
8 2021 01 017 14 0 0.0809 TRUE
9 2021 01 019 16 1 0.0925 TRUE
10 2021 01 021 18 0 0.104 TRUE
# … with 16 more rows
# ℹ Use `print(n = ...)` to see more rows
I think you'll want to use group_by() and mutate() here.
library(dplyr)
testitem |>
## grouping by year and quarter
## sums will be calculated over areas
group_by(year, qtr) |>
## this could be more terse, but gets the job done.
mutate(total_sum = sum(employment),
## This uses the total_sum column that was just created
total_prop = employment/total_sum,
## leveraging the 0,1 coding of suppress
suppress_sum = sum(suppress * employment),
suppress_prop = suppress_sum/total,
fifty = (1-suppress_prop) > 0.5)
I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))
I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )
It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5
We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)
I am trying to create a function in R that will allow me to determine the date at which a product will be out of stock. I would like this function to be able to account for scheduled incoming orders and show a "running total" of units in stock. Below is a reproducible idea of what I have been able to do thus far.
library(tidyverse)
library(lubridate)
runrate <- 25
onHand <- tibble(date = Sys.Date(), OnHand = 2000)
ord_tbl <- tibble(date = c(ymd("2020-04-09"), ymd("2020-04-12"), ymd("2020-04-17")), onOrder = c(200, 500, 100))
date_tbl <- tibble(date = seq.Date(from = Sys.Date(), to = Sys.Date() + 180, by = "day")) %>%
mutate(Month = month(date, label = TRUE))
joined_tbl <- date_tbl %>%
left_join(onHand) %>%
left_join(ord_tbl)
joined_tbl <- joined_tbl %>%
mutate(OnHand = coalesce(joined_tbl$OnHand, 0),
onOrder = coalesce(joined_tbl$onOrder, 0),
id = row_number()) %>%
mutate(usage = id * runrate) %>%
select(id, everything())
start_inv_value <- joined_tbl %>%
filter(date == Sys.Date()) %>%
select(OnHand)
joined_tbl %>%
mutate(projected_On_Hand = start_inv_value$OnHand - (id * usage) + onOrder)
Ideally, I would like to take the starting inventory values on hand and then subtract the daily usage and add in units that are expected to be received; however, I am unable to bring down the previous days projected_on_hand value.
The anticipated results would look like this:
Thank you for your help!
I think you might want to include a cumulative sum of onOrder (use cumsum). In addition, you can just subtract usage for each row.
joined_tbl %>%
mutate(projected_On_Hand = start_inv_value$OnHand - usage + cumsum(onOrder))
Output
# A tibble: 181 x 7
id date Month OnHand onOrder usage projected_On_Hand
<int> <date> <ord> <dbl> <dbl> <dbl> <dbl>
1 1 2020-04-08 Apr 2000 0 25 1975
2 2 2020-04-09 Apr 0 200 50 2150
3 3 2020-04-10 Apr 0 0 75 2125
4 4 2020-04-11 Apr 0 0 100 2100
5 5 2020-04-12 Apr 0 500 125 2575
6 6 2020-04-13 Apr 0 0 150 2550
7 7 2020-04-14 Apr 0 0 175 2525
8 8 2020-04-15 Apr 0 0 200 2500
9 9 2020-04-16 Apr 0 0 225 2475
10 10 2020-04-17 Apr 0 100 250 2550
I'm using R to do my data analysis.
I'm looking for the code to achieve the below mentioned output.
I need a single piece of code to do this as I have over 500 groups & 24 months in my actual data. The below sample has only 2 groups & 2 months.
This is a sample of my data.
Date Group Value
1-Jan-16 A 10
2-Jan-16 A 12
3-Jan-16 A 17
4-Jan-16 A 20
5-Jan-16 A 12
5-Jan-16 B 56
1-Jan-16 B 78
15-Jan-16 B 97
20-Jan-16 B 77
21-Jan-16 B 86
2-Feb-16 A 91
2-Feb-16 A 44
3-Feb-16 A 93
4-Feb-16 A 87
5-Feb-16 A 52
5-Feb-16 B 68
1-Feb-16 B 45
15-Feb-16 B 100
20-Feb-16 B 81
21-Feb-16 B 74
And this is the output I'm looking for.
Month Year Group Minimum Value 5th Percentile 10th Percentile 50th Percentile 90th Percentile Max Value
Jan 2016 A
Jan 2016 B
Feb 2016 A
Feb 2016 B
considering dft as your input, you can try:
library(dplyr)
dft %>%
mutate(Date = as.Date(Date, format = "%d-%b-%y")) %>%
mutate(mon = month(Date),
yr = year(Date)) %>%
group_by(mon,yr,Group) %>%
mutate(minimum = min(Value),
maximum = max(Value),
q95 = quantile(Value, 0.95)) %>%
select(minimum, maximum, q95) %>%
unique()
which gives:
mon yr Group minimum maximum q95
<int> <int> <chr> <int> <int> <dbl>
1 1 2016 A 10 20 19.4
2 1 2016 B 56 97 94.8
3 2 2016 A 44 93 92.6
4 2 2016 B 45 100 96.2
and add more variables as per your need.