Complicated data formation - r

So I am trying to make a separate dataset that combines the yearly absence percentage and additionally binary variable of those with 10% or more total absence a year.
The absencePercentage should be calculated bycalculating total unauthorised and authorised absence divided by total possible sessions in all three terms.
Another thing is VioFlag. If the person has been flagged for Vio in at least one of the term, they should be flagged as VioFlagEver.
So the original data is like this:
ID PossibleSessions Term year unauthorisedAbsence authorisedAbsence VioFlag
0110 46 Sum 2014 0 1 0
0110 116 Win 2014 1 8 1
0110 56 Spr 2014 0 5 0
0110 44 Sum 2015 21 9 0
0110 120 Win 2015 2 2 0
0110 58 Spr 2015 10 1 0
So for ID 0110, he was absent for 15 sessions (0+1+1+8+0+5=15) out of possible 218 sessions (46+116+56=218). This means the absence percentage in 2014 for ID 0110 is 6.88%. He will not be the frequent absentee that year. But because in 2015, his absent rate was 20.27%, he will be a frequent absentee.
For ID 0110, He will be VioFlagEver for 2014 for not for 2015.
The new dataset I want to create is this.
ID year absencePercentage FrenquentAbsentee VioFlagEver
0110 2014 6.88 0 1
0110 2015 20.27 1 0
Please note that there are many IDs and year 2014 to 2018.
Thank you for your help!

You can try this:
library(tidyverse)
df %>% group_by(ID, year) %>%
summarize(absensepercentage = ((sum(unauthorisedAbsence) + sum(authorisedAbsence)) / sum(PossibleSessions))*100,
violflagever = if_else(sum(VioFlag) > 0, 1, 0),
frequentabsentee = if_else(absensepercentage > 10, 1, 0))

You can use tidyverse (dplyr) group_by and summarize to achieve this
library(tidyverse)
read.table(textConnection("ID PossibleSessions Term year unauthorisedAbsence authorisedAbsence VioFlag
0110 46 Sum 2014 0 1 0
0110 116 Win 2014 1 8 1
0110 56 Spr 2014 0 5 0
0110 44 Sum 2015 21 9 0
0110 120 Win 2015 2 2 0
0110 58 Spr 2015 10 1 0"),
header = T) %>%
as_tibble() -> df
df %>%
mutate(totalAbscence = unauthorisedAbsence+authorisedAbsence) %>%
group_by(ID, year) %>%
summarise(possibleAbscence = PossibleSessions %>% sum(),
totalAbscence = totalAbscence %>% sum(),
VioFlagEver = VioFlag %>% sum()) %>%
mutate(absencePercentage = (totalAbscence/possibleAbscence)*100,
FrenquentAbsentee = if_else(absencePercentage > 10, 1,0),
VioFlagEver = if_else(VioFlagEver > 0, 1, 0))
#> `summarise()` regrouping output by 'ID' (override with `.groups` argument)
#> # A tibble: 2 x 7
#> # Groups: ID [1]
#> ID year possibleAbscence totalAbscence VioFlagEver absencePercenta…
#> <int> <int> <int> <int> <dbl> <dbl>
#> 1 110 2014 218 15 1 6.88
#> 2 110 2015 222 45 0 20.3
#> # … with 1 more variable: FrenquentAbsentee <dbl>
Created on 2021-01-27 by the reprex package (v0.3.0)

Related

how best to calculate this share of a total

Below is the sample data. The goal is to first create a column that contains the total employment for that quarter. Second is to create a new column that shows the relative share for the area. Finally, the last item (and one which is vexing me) is to calculate whether the total with suppress = 0 represents over 50% of the total. I can do this in excel easily but trying to this in R and so have it be something that I can replicate year after year.
desired result is below
area <- c("001","005","007","009","011","013","015","017","019","021","023","027","033","001","005","007","009","011","013","015","017","019","021","023","027","033")
year <- c("2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021")
qtr <- c("01","01","01","01","01","01","01","01","01","01","01","01","01","02","02","02","02","02","02","02","02","02","02","02","02","02")
employment <- c(2,4,6,8,11,10,12,14,16,18,20,22,30,3,5,8,9,12,9,24,44,33,298,21,26,45)
suppress <- c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0)
testitem <- data.frame(year,qtr, area, employment, suppress)
For the first quarter of 2021, the total is 173. If you only take suppress = 1 into account, that is only 24 of 173 hence the TRUE in the 50 percent column. If these two values summed up to 173/2 or greater than you would have it say FALSE. For the second quarter, the suppress = 1 accounts for 310 of the total of 537 and so is over 50% of the total.
For the total column, I am showing the computation or ingredients. Ideally, it would show a value such as .0115 in place of 2/173.
year qtr area employment suppress total 50percent
2021 01 001 2 0 =2/173 TRUE
2021 01 005 4 0 =4/173 TRUE
.....
2021 02 001 3 0 =3/537 FALSE
2021 02 005 5 0 =5/537 FALSE
For example:
library(dplyr)
testitem %>%
group_by(year, qtr) %>%
mutate(
total = employment / sum(employment),
over_half = sum(employment[suppress == 0]) > (0.5 * sum(employment))
)
Gives:
# A tibble: 26 × 7
# Groups: year, qtr [2]
year qtr area employment suppress total over_half
<chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
1 2021 01 001 2 0 0.0116 TRUE
2 2021 01 005 4 0 0.0231 TRUE
3 2021 01 007 6 0 0.0347 TRUE
4 2021 01 009 8 1 0.0462 TRUE
5 2021 01 011 11 0 0.0636 TRUE
6 2021 01 013 10 0 0.0578 TRUE
7 2021 01 015 12 0 0.0694 TRUE
8 2021 01 017 14 0 0.0809 TRUE
9 2021 01 019 16 1 0.0925 TRUE
10 2021 01 021 18 0 0.104 TRUE
# … with 16 more rows
# ℹ Use `print(n = ...)` to see more rows
I think you'll want to use group_by() and mutate() here.
library(dplyr)
testitem |>
## grouping by year and quarter
## sums will be calculated over areas
group_by(year, qtr) |>
## this could be more terse, but gets the job done.
mutate(total_sum = sum(employment),
## This uses the total_sum column that was just created
total_prop = employment/total_sum,
## leveraging the 0,1 coding of suppress
suppress_sum = sum(suppress * employment),
suppress_prop = suppress_sum/total,
fifty = (1-suppress_prop) > 0.5)

how to sum conditional functions to grouped rows in R

I so have the following data frame
customerid
payment_month
payment_date
bill_month
charges
1
January
22
January
30
1
February
15
February
21
1
March
2
March
33
1
May
4
April
43
1
May
4
May
23
1
June
13
June
32
2
January
12
January
45
2
February
15
February
56
2
March
2
March
67
2
April
4
April
65
2
May
4
May
54
2
June
13
June
68
3
January
25
January
45
3
February
26
February
56
3
March
30
March
67
3
April
1
April
65
3
June
1
May
54
3
June
1
June
68
(the id data is much larger) I want to calculate payment efficiency using the following function,
efficiency = (amount paid not late / total bill amount)*100
not late is paying no later than the 21st day of the bill's month. (paying January's bill on the 22nd of January is considered as late)
I want to calculate the efficiency of each customer with the expected output of
customerid
effectivity
1
59.90
2
100
3
37.46
I have tried using the following code to calculate for one id and it works. but I want to apply and assign it to the entire group id and summarize it into 1 column (effectivity) and 1 row per ID. I have tried using group by, aggregate and ifelse functions but nothing works. What should I do?
df1 <- filter(df, (payment_month!=bill_month & id==1) | (payment_month==bill_month & payment_date > 21 & id==1) )
df2 <-filter(df, id==1001)
x <- sum(df1$charges)
x <- sum(df2$charges)
100-(x/y)*100
An option using dplyr
library(dplyr)
df %>%
group_by(customerid) %>%
summarise(
effectivity = sum(
charges[payment_date <= 21 & payment_month == bill_month]) / sum(charges) * 100,
.groups = "drop")
## A tibble: 3 x 2
#customerid effectivity
# <int> <dbl>
#1 1 59.9
#2 2 100
#3 3 37.5
df %>%
group_by(customerid) %>%
mutate(totalperid = sum(charges)) %>%
mutate(pay_month_number = match(payment_month , month.name),
bill_month_number = match(bill_month , month.name)) %>%
mutate(nolate = ifelse(pay_month_number > bill_month_number, TRUE, FALSE)) %>%
summarise(efficiency = case_when(nolate = TRUE ~ (charges/totalperid)*100))

Using dplyr mutate function to create new variable conditionally based on current row

I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )
It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5
We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)

R dplyr: Calculating row values by using previous row value and a function

I am trying to create a function in R that will allow me to determine the date at which a product will be out of stock. I would like this function to be able to account for scheduled incoming orders and show a "running total" of units in stock. Below is a reproducible idea of what I have been able to do thus far.
library(tidyverse)
library(lubridate)
runrate <- 25
onHand <- tibble(date = Sys.Date(), OnHand = 2000)
ord_tbl <- tibble(date = c(ymd("2020-04-09"), ymd("2020-04-12"), ymd("2020-04-17")), onOrder = c(200, 500, 100))
date_tbl <- tibble(date = seq.Date(from = Sys.Date(), to = Sys.Date() + 180, by = "day")) %>%
mutate(Month = month(date, label = TRUE))
joined_tbl <- date_tbl %>%
left_join(onHand) %>%
left_join(ord_tbl)
joined_tbl <- joined_tbl %>%
mutate(OnHand = coalesce(joined_tbl$OnHand, 0),
onOrder = coalesce(joined_tbl$onOrder, 0),
id = row_number()) %>%
mutate(usage = id * runrate) %>%
select(id, everything())
start_inv_value <- joined_tbl %>%
filter(date == Sys.Date()) %>%
select(OnHand)
joined_tbl %>%
mutate(projected_On_Hand = start_inv_value$OnHand - (id * usage) + onOrder)
Ideally, I would like to take the starting inventory values on hand and then subtract the daily usage and add in units that are expected to be received; however, I am unable to bring down the previous days projected_on_hand value.
The anticipated results would look like this:
Thank you for your help!
I think you might want to include a cumulative sum of onOrder (use cumsum). In addition, you can just subtract usage for each row.
joined_tbl %>%
mutate(projected_On_Hand = start_inv_value$OnHand - usage + cumsum(onOrder))
Output
# A tibble: 181 x 7
id date Month OnHand onOrder usage projected_On_Hand
<int> <date> <ord> <dbl> <dbl> <dbl> <dbl>
1 1 2020-04-08 Apr 2000 0 25 1975
2 2 2020-04-09 Apr 0 200 50 2150
3 3 2020-04-10 Apr 0 0 75 2125
4 4 2020-04-11 Apr 0 0 100 2100
5 5 2020-04-12 Apr 0 500 125 2575
6 6 2020-04-13 Apr 0 0 150 2550
7 7 2020-04-14 Apr 0 0 175 2525
8 8 2020-04-15 Apr 0 0 200 2500
9 9 2020-04-16 Apr 0 0 225 2475
10 10 2020-04-17 Apr 0 100 250 2550

Percentile for multiple groups of values in R

I'm using R to do my data analysis.
I'm looking for the code to achieve the below mentioned output.
I need a single piece of code to do this as I have over 500 groups & 24 months in my actual data. The below sample has only 2 groups & 2 months.
This is a sample of my data.
Date Group Value
1-Jan-16 A 10
2-Jan-16 A 12
3-Jan-16 A 17
4-Jan-16 A 20
5-Jan-16 A 12
5-Jan-16 B 56
1-Jan-16 B 78
15-Jan-16 B 97
20-Jan-16 B 77
21-Jan-16 B 86
2-Feb-16 A 91
2-Feb-16 A 44
3-Feb-16 A 93
4-Feb-16 A 87
5-Feb-16 A 52
5-Feb-16 B 68
1-Feb-16 B 45
15-Feb-16 B 100
20-Feb-16 B 81
21-Feb-16 B 74
And this is the output I'm looking for.
Month Year Group Minimum Value 5th Percentile 10th Percentile 50th Percentile 90th Percentile Max Value
Jan 2016 A
Jan 2016 B
Feb 2016 A
Feb 2016 B
considering dft as your input, you can try:
library(dplyr)
dft %>%
mutate(Date = as.Date(Date, format = "%d-%b-%y")) %>%
mutate(mon = month(Date),
yr = year(Date)) %>%
group_by(mon,yr,Group) %>%
mutate(minimum = min(Value),
maximum = max(Value),
q95 = quantile(Value, 0.95)) %>%
select(minimum, maximum, q95) %>%
unique()
which gives:
mon yr Group minimum maximum q95
<int> <int> <chr> <int> <int> <dbl>
1 1 2016 A 10 20 19.4
2 1 2016 B 56 97 94.8
3 2 2016 A 44 93 92.6
4 2 2016 B 45 100 96.2
and add more variables as per your need.

Resources